INTRODUCTION TO PROBABILITY 
AND RANDOM PROCESSES 



by 



Kenneth Baclawski 
and 

Gian-Carlo Rota 



Copyright (E) 1979 by Kenneth Baclawski and Gian-Carlo Rota. 
All rights reserved. 



TABLE OF CONTENTS 



List of Tables vi 

Introduction vii 

I. Sets, Events and Probability 

1. The Algebra of Sets 1.2 

2. The Bernoulli Sample Space 1.8 

3. The Algebra of Multisets 1.11 

if. The Concept of Probability 1.13 

Properties of Probability Measures 1.15 

5. Independent Events 1.18 

6. The Bernoulli Process 1.20 

7. Exercises 1 .23 

8. Answers to Selected Exercises 1.29 

II. Finite Processes 

1. The Basic Models 2.1 

2. Counting Rules and Stirling* s Formula 2.5 

The First Rule of Counting 2.5 

Stirling's Formula 2.8 

The Second Rule of Counting 2.9 

3. Computing Probabilities 2.13 

Indistinguishability 2.19 

Fermi-Dirac Statistics: Subsets 2.20 

Bose-Einstein Statistics: Multisets 2.21 

5? Identities for Binomial and Multiset 

Coefficients 2.25 

6? Random Integers 2.30 

7. Exercises 2.3*f 

8. Answers to Selected Exercises 2.44 



III. Random Variables 



1. Integer Random Variables 3.2 

A. The Bernoulli Process: 

tossing a coin 3.4 

B. The Bernoulli Process: 

random walk 3.10 

C. Independence and Joint 

Distributions 3.11 

D* Fluctuations of Random Walks 3-17 
First Passage Time and the 

Reflection Principle 3.17 

Maximum Position 3.21 

E. Expectations 3.24 

Ft The Inclusion-Exclusion Principle 3-28 

2. General Random Variables 3<38 

The Concept of a Random Variable 3.44 

Integer Random Variables 3-45 

Continuous Random Variables 3.45 

Independence 3.47 
Properties of Densities and 

Distributions 3*48 

Joint Distribution and Joint Density 3.48 

Expectation 3-50 

3. The Uniform Process 3.50 

4. Table of Probability Distributions 3.61 

5. Exercises 3.64 

6. Ansv/ers to Selected Exercises 3*83 



IV. Statistics and the "lorrnal Distribution 



1 . Variance 4.2 

Bernoulli D rocess 4.8 

Uniform Process 4.9 

Standardization 4.11 

2. The Bell-Shaped Curve 4.13 

3. The Central Limit Theorem 4.19 

Statistical Measurements 4.23 



if. Significance Levels if. 26 

5. Confidence Intervals 4.31 

6? The Proof of the Central Limit Theorem if. 36 

7t The Law of Large Numbers 4.41 

8. Exercises 4.47 

9. Answers to Selected Exercises ^ . 69 

V. Conditional Probability 

1. Discrete Conditional Probability 5.1 

Law of Alternatives 5-3 

Bayes' Law 5-5 

Law of Successive Conditioning 5-6 

Independence 5-7 

2. Gaps and Runs in the Bernoulli Process 5.8 

3. Sequential Sampling 5.14 

Exchangeability 5.17 

The Polya Urn Process 5.20 

if? The Arcsine Law of Random Walks 5-22 

5. Continuous Conditional Probability 5.27 

The Continuous Law of Alternatives 5.32 

6. Conditional Densities 5.35 

The Continuous Bayes' Law 5.37 
The Continuous Law of Successive 

Conditioning 5-38 

7. Gaps in the Uniform Process 5.39 

Meedles on a Stick 5.39 

Exchangeability of the Gaps 5.42 

8. The Algebra of Probability Distributions 5*47 

Change of Variables 5-47 

Sums of Independent Random Variables 5.49 

9? Geometric Probability 5.53 

10. Exercises 5.54 

11. Answers to Selected Exercises 5.73 



-iii- 



VI. The Poisson Process 

1. Continuous Waiting Times 6.1 

The Exponential Distribution 6.1 

The Gamma Distribution 6.8 

2. Comparing the Bernoulli and Uniform 

Processes 6.10 

3. The Poisson Sample Space 6.21 

Sums of Independent Poisson Random 

Variables 6.28 
Physical Systems and the Poisson Process 6.29 

Gaps and Waiting Times 6.32 
The Uniform Process from the Poisson 

Process 6.33 

4. The SchrtJdinger Method 6.36 

5. Randomized and Compound Processes 6.45 

Randomized Uniform Process 6*45 

Randomized Poisson Process 6.48 

Finite Sampling Processes 6.51 

6? Reliability Theory 6.53 

7. Exercises 6,56 

8. Answers to Selected Exercises 6. 

VII. Entropy and Information 

1. Discrete Entropy 7»1 

Partitions 7.1 

Entropy 7.5 

Properties of Entropy 7.10 

Uniqueness of Entropy 7.14 

2* The Shannon Coding Theorem 7.17 

3. Continuous Entropy 7.27 

Relative Entropy 7.27 

Boltzmann Entropy 7.32 

Standard Entropy 7.36 

Summary 7.41 

4. Exercises 7.44 

5. Answers to Selected Exercises 

7.46 



-iv- 



VIII. Markov Chains 

1. The Markov Property 8.1 

2. The Puin Problem 8.8 

3. The Graph of a Markov Chain 8.18 
if. The Markov Sample Space 8.27 

5. Steady States of Ergodic Markov Chains 8.3k 

Waiting Times and the Recurrence 

Theorem 8.38 

6. Exercises S.kk 

7. Answers to Selected Exercises 8.if8 



IX. Markov Processes and Queuing Theory 

1 . The Markov Property 

2. Queuing Theory 

3. The Chapmann-Kolmogorov Equations 
if. Steady States and Recurrence Times 

5. Exercises 

6. Answers to Selected Exercises 

X. Brownian Motion 

1 . Continuous Value Markov Processes 

2. The Wiener Process 

3. Fluctuation Theory 
if. Diffusion Theory? 

Index 



-v- 



LIST OF TABLES 



Probability Measures ' *^ 

Basic Models of Finite Sampling Processes 2.4 

Table 1 : Placements 2.8 

Table 2: Multinomial and Binomial Coefficients 2.13 

Table 3: Placements of Indistinguishable Balls 

into Boxes 2.25 

Fluctuation Distributions 3.23 

Table of Bernoulli and Uniform Distributions 3.63 

Basic Properties of Variance and Standard Deviation 4.7 

Table of Means and Variances 4.11 

Normal Distribution Function 4.46 

Table of Conditioning Laws 5*45-46 

Table of Poisson Distributions 6. -5"? 

Table of Analogies: Bernoulli, Poisson and 

Uniform Processes &*&5~ 

Table of Entropies and Maximum Entropy 

Distributions 7.42-43 



vi 



INTRODUCTION 



Probability is one of the great achievements of this 
century. Like geometry, it is a way of looking at nature. 
There are many ways of approaching natural problems, many 
points of view. The geometrical point of view has been with 
us for thousands of years. The probabilistic point of view 
is another way of focusing on problems that has been success- 
ful in many instances. The purpose of this course is to 
learn to think probabilistically. Unfortunately the only way 
to learn to think probabilisticallv is to learn the theorems 
of probability. Only later, as one has mastered the theorens , 
does the probabilistic point of view begin to emerge while 
the specific theorems fade in one's memory: much as the 
grin on the Cheshire cat. 

We begin by giving a bird's-eye view of probability by 
examining some of the great unsolved problems of probability 
theory. It's only by seeing what the unsolved problems 
are that one gets a feeling for a field. Don't expect to be 
able to understand at this point everything about the problems 
we are about to give. They are difficult and are meant to 
be just a hint of things to come. 
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Pennies on a carpet . We have a rectangular carpet and an 
indefinite supply of perfect pennies. What is the proba- 
bility that if we drop the pennies on the carpet at random 
no two of them will overlap? This problem is one of the 
most important problems of statistical mechanics. If we 
could answer it we would know, for example, why water boils 
at 100°C, on the basis of purely atomic computations. 
Nothing is known about this problem. 

On the other hand, the one-dimensional version of this 
problem can be solved. We shall, in fact, solve it several 
different ways. The problem here is to drop n needles of 
length h on a stick of length b at random. The probability 
that no two needles overlap is: 




n 



if b>nh 



if b<nh 
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Pennies on a carpet 



Needles on a stick 
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The striking difference between the difficulty of a 
problem in two dimensions and that of the corresponding 
problem in one dimension is called the "dimensional barrier". 
It is an illustration of a common problem of physicists: 
problems in low dimensions are considerably easier to solve 
than their "real world" counterparts. 

The only technique that we can presently apply to this 
problem is the "method of ignorance" or the "Monte Carlo 
method": namely simulate the problem on a computer, and 
see what happens. Usually a few iterations will give a re- 
markably accurate answer, when the number of coins is small. 
Random walk . We consider the grid on the plane with integral 
corners. A drunkard starts at the origin and walks in one of 
the four directions with equal probability 1/4 to the next 
corner. He then repeats this process at the next corner. 

It is already an interesting mathematical question to 
set up this problem so that one can answer such questions 

as for example how long it will take 
the drunkard to get home. The answer 
is not a number but rather a probability 
distribution ; that is, there is a certain 
probability that he will get home in 1 
step, 2 steps, 3 steps, etc. 




a random walk 



So this is a typical case for which we ask a question, 
and we get an answer that is not a number but rather a 
string of numbers each with a suitable probability. We 
call this "answering a question probabilistically" or 
"using probabilistic reasoning." 

One can completely answer the above question, given 
the position and shape of the drunkard's home. This con- 
nects with a branch of physics called potential theory. 

A question that has never been answered is to find 
the probability that after n steps the drunkard has never 
retraced his steps. We call such a random walk a self - 
avoiding random walk . This is related to the problem of 
polymer growth in chemistry. Of course it appears that 
this is a very special, stereotyped problem, but it turns 
out that if we can solve this stereotyped problem, we can 
solve all the others by suitable coordinate changes. This 
will be the case also in problems we shall subsequently 
encounter. 

Cluster analysis . Suppose that we have a collection of 
dots arrayed in the plane or in space, much as stars in the 
sky. The individual points obey no specific physical law, 
but the whole ensemble does. The problem is to invent the 
possible physical laws that such ensembles of dots can satisfy. 
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For example how can one describe that a pattern of 
dots obeys certain clustering structures? Physically it 
is sometimes quite obvious: we just look in the sky. But 
we want a purely numerical, quantitative description. This 
is the theory of stochastic point processes. 

Of course this problem is closely related to the problem 
of pattern recognition. 

Brownian Motion . We must first mention a function called 

the normal density function: f (x) = e - . This 

/2V 

function occurs so often in nature as well as in probability 

that one is tempted to call it the most important function 

there is. It looks like this: 




this is the famous "bell-curve". 

Now a realistic model of the path of a drunkard is one 
that wanders in a continuous path starting at the origin. 
How does one assign probabilities to paths and what does it 
mean to follow a path at random? This was done by Norbert 
Wiener who showed that if there is a straight line barrier 
in the plane and if we consider the question of where the 
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drunkard first hits the barrier we get precisely the normal 
density function. 

probability density of 
hitting each point 
along the barrier 

barrier 

This fact enables us to compute, just as with the case of 
discrete random walks, the probability distribution for when 
the drunkard arrives home in terms of the position and shape 
of his home. This is the problem of Brownian motion . Unlike 
the others, this problem has been completely solved. 
Contagion or Percolation . Imagine that we have an orchard 
with evenly spaced trees and that at some time some trees 
become infected. Suppose that there is a certain probability 
that a given infected tree infects one or more neighboring 
trees before the given tree dies. 

O O O O o O 

o o 00*0 orchard with 

o o o o o o infected tree 

One of two things can happen: either the infection 
stays among small clusters of infected trees and eventually 
dies out or the whole orchard is wiped out. One can show 
that if the probability of one infected tree infecting another 

is p there is a critical probability p such that if p<p the 

c c 

disease will die out but that if P > P C the disease will 
spread forever. How does one compute p c ? 
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Noise . We consider a signal sent from a radio transmitter 
to a receiver but which is perturbed by noise along the way. 
The problem of filtering out the noise is a very important 
one for electrical engineers. The whole theory of noise 
filtering consists of computations involving the normal 
density function. 

Coin Tossing . The detailed structure of the fluctuations 
occurring in the tossing of a fair coin are counter-intuitive. 
We imagine a game for which at each toss of a fair coin we 
win $1 if it comes up heads and we lose $1 if it comes up 
tails. If we graph our net winnings in time we see that it 
can cross the time axis if we switch from a net gain to a 

net loss or vice versa. If after 
a period of time we find that we 
have a net gain of zero, what is 
the most probable number of times 
we crossed over the axis along the 
net winnings way? The answer is that the most 

probable case is no times at all I 

In effect one can interpret this as saying that during 
a long betting session the most probable occurrence is to 
have either a winning streak or a losing streak. Frequent 
changes from one to the other are actually unlikely. 
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Cell Growth . How does living tissue grow? We consider 
a stereotyped case. We start with a little square and then 
imagine that with some probability the square produces a new 
square on one of its four sides. The growth proceeds on the 
boundary by a simple model. What is the pattern that such 
growth will produce? What is the probability that the tissue 
will enclose an island? 

a cn cB 

(a) (b) (c) 

* # # 

The problems described above are just a sampling of the 
many interesting unsolved problems of probability. Perhaps 
you will be the one to solve one of them.-.. 



Chapter I Sets, Events and Probability 

Suppose that we toss a coin any number of times and 
that we list the information of whether we got heads or 
tails on the successive tosses: 

H T H H T T T 

1 2 3 4 5 6 7 . 

The act of tossing this coin over and over again as we have 
done is an example of an experiment , and the sequence of 
H's and T's that we have listed is called its experimental 
outcome . We now ask what it means, in the context of our 
experiment, to say that we got a head on the fourth toss. 
We call this an event . While it is intuitively obvious what 
an event represents, we want to find a precise meaning for 
this concept. One way to do this which is both obvious 
and subtle is to identify an event with the set of all ways 
that the event can occur. For example, the event "the fourth 
toss is a head" is the same as the set of all sequences of 
H's and T's whose fourth entry is H. At first it appears 
that we have said little, but in fact we have made a conceptual 
advance. We have made the intuitive notion of an event into 
a concrete notion: events are a certain kind of set . 

However a warning is appropriate here. An event is not 
the same concept as that of an experimental outcome. An 
outcome consists of the total information about the experiment 
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after it has been performed. Thus while an event may be easy 
to describe, the set to which it corresponds consists of a 
great many possible experimental outcomes, each being quite 
complex. In order to distinguish the concept of an event 
from the concept of an experimental outcome we will employ 
an artificial term for the latter. We will call it a 
sample point . Now a sample point will seldom look like an 
actual point in the geometric sense. We use the word "point" 
to suggest the "indivisibility" of one given experimental 
outcome, in contrast to an event which is made up of a great 
many possible outcomes. The term "sample" is suggestive of 
the random nature of our experiment, where one particular 
sample point is only one of many possible outcomes. 

We will begin with a review of the theory of sets, with 
which we assume some familiarity. We will then extend the 
concept of a set by allowing elements to occur more than 
just once. We call such an entity a multiset . By one 
more conceptual step, the notion of a probability measure 
emerges as an abstraction derived from the multiset concept. 
Along the way we will repreatedly return to our coin-tossing 
experiment. We do this not only because it is a good example 
but also because we wish to emphasize that probability 
deals with very special kinds of sets. 

1 . The Algebra of S ets 

In probability we always work within a context , which 
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we define by specifying the set of all possible experimental 
outcomes or equivalently all possible sample points. We 
call this set the sample space , typically denoted ft 
The term "sample space" does not help one to visualize ft 
any more than "sample point" is suggestive of an experimental 
outcome. But this is the term that has become standard. 
Think of ft as the "context" or "universe of discourse". 
It does not, however, in itself define our experiment. 
Quite a bit more will be required to do this. One such 
requirement is that we must specify which subsets of ft 
are to be the events of our experiment. In general not every 
subset will be an event. The choice of subsets which are 
to be the events will depend on the phenomena to be studied. 

We will specify the events of our experiment by specifying 
certain very simple events which we will call the "elementary 
events", which we then combine to form more complicated events. 
The ways we combine events to form other events are called 
the Boolean or logical operations . The most important of 
these are the following: 

union AUB is the set of elements either in A 

or in B (or both) . 
intersection AOB is the set of elements both in A and 

in B . 

complement A is the set of elements not in A. 
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Each of these has a natural interpretation in terms of 
events. Let A and B be two events. 

AUB is the event "either A or B (or both) occur" 
AfiB is the event "both A and B occur" 
A~ is the event "A does not occur" 

Several other Boolean operations are defined in the exercises. 

When two events A and B have the property that if 
A occurs then B does also (but not necessarily vice versa) , 
we will say that A is a subevent of B and will write 
ACB. In set-theoretic language one would say that A is a 
subset of B or that B contains A . 

The three Boolean operations and the subevent relation 
satisfy a number of laws such as commutativity , associativity, 
distributivity and so on, which we will not discuss in detail, 
although some are considered in the exercises. For example 
the DeMorgan laws are the following: 

Af\B = AUB 

AuB = A A B . 

In terms of events, the first of these says that if it is not 
true that both A and B occur, then either A does not 
occur or B does not occur (or both), and conversely. One 
has a similar statement for the second law. Generally 
speaking, drawing a Venn diagram suffices to prove anything 
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about these operations. For example, here is the Venn 
diagram proof of the first De Morgan law. First draw the 
two events A and B: 

i ! 




If we shade in the event AnB : 




then the event AnB consists of the shaded portion of the 
following diagram: 




Next shade in the events A and B , respectively: 



Combining these gives us the event AUB: 




If we compare this with the event AnB we see that 



AfiB = AUB . 

For more complicated expressions, involving many sets, 
and for which the Venn diagram would be extremely complex ) it 
is very useful to know that there is a way we can simplify 
such an expression into an essentially unique expression. 
The idea is that every Boolean expression is a union of the 
smallest subevents obtainable by intersecting events occurring 
in the expression. To be more precise suppose that we have 
an expression involving the events A^ , A^, and 
unions, intersections and complements in any order nested as 
deeply as required. The simplified expression we obtain can 
be described in two steps. 



Step 1 . Write A + "*" = A and A 1 = A . This is just a 
notational convenience; it has no metaphysical significance 
The expressions 



1-, 2. 1 
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as i n ,i^f'-*fi take on all possible choices of ±1, are 
12 n 

the smallest events obtainable from the events A i' A 2'*"'' A n 
using Boolean operations. We call these events the atoms 
defined by h , A 2 , • • • , A n . Notice that in general there can 
be 2 n of them, but that in particular cases some of the atoms may 
be empty so there could be fewer than 2 n in all. 




Step 2 . Any expression involving A i' A 2'"'"' A n and 
using Boolean operations can be written as a union of 

certain of the atoms. There are many procedures that can 
be used to determine which of the atoms are to be used. 
We leave it as an exercise to describe such a procedure. 
The resulting expression will be called the "atomic 
decomposition". By using Venn diagrams and atomic decomp- 
ositions, any problem involving a finite number of events 
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can be analyzed in a straightforward way. Unfortunately 
many of our problems will involve infinitely many events and 
for these we will later need some new ideas. 

2 . The Bernoulli Sample Space 

We now return to the example that began this chapter: 
tossing a coin. A sample point for this experiment is an 
infinite sequence of ones and zeroes or equivalently of H's 
and T's. Just for variety we will also sometimes refer to a 
toss of heads as a "success" and tails as a "failure". Even 
if we are only concerned with a finite sequence of H's and 
T's, which is seemingly more realistic, it is nevertheless 
easier for computational reasons to imagine that we could go 
on tossing the coin forever. Moreover, we will find that 
certain seemingly rather ordinary events can only be 
expressed in such a context. 

The set of all possible sequences of ones and zeroes 
is called the Bernoulli sample space ft and the experimental 
process of which it forms the basis is called the Bernoulli 
process . For the moment we will be a little vague about what a 
"process" means, but we will make it precise later. The 
events of the Bernoulli sample space consist of certain 
subsets of ft .To describe which subsets these are, we 
first describe some very simple events called elementary 
events . They are the events "the first toss comes up heads", 
"the second toss comes up heads", etc. We will write 
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H for the event "the n toss comes up heads". The complemen 
n 

of the event H n will be devoted T r = H n ; it is the 

event "the n th toss comes up tails." One must be careful 

here. It is obvious that the complementary event to "the 

n th toss is heads" is "the n th toss is tails." However, it 

is less obvious that as sets H~ n = T n , since both H r and 

T are infinite sets and it is not easy to imagine what 
n 

they "look like". As a rule it is much easier to think in 
terms of the events themselves rather than in terms of their 
representations as sets. 

We can now describe in general what it means for a subset 
of Q to be an event of the Bernoulli smaple space: an event 
is any subset of ft obtainable from the elementary events by 
using the operations of complement as well as of unions and 
intersections of infinite sequences of events. The fact that 
we allow infinite unions and intersections will take some 
getting used to. What we are saying is that we allow any 
statements about the Bernoulli process which may in principle 
be expressed in terms of tosses of heads and tails (elementary 
events) using the words "and" (intersection) , "or" (union) , 
"not" (complement) and "ever" (infinite sequences) . 

To illustrate this we consider the following example of 
a Bernoulli event: "a sequence of two successive H's occurs 
before a sequence of two T's ever occurs." We will call a 
specified finite sequence of H's and T's a run . So the 
event in question is "the run HH occurs before the run TT 
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ever occurs". Write A for this event. The essence of the 
event A is that we will continue flipping that coin until 
either an HH or a TT occurs. When one of them happens 
we may then quit, or we may not; but it is nevertheless 
computationally easier to conceive of the experiment as 
having continued forever. Now it ought to be conceptually 
clear that it is possible to express A in terms of 
elementary Bernoulli events, but at first it may seem 
mysterious how to do it. The idea is to break apart A 
into simpler events which can each be expressed relatively 
easily in terms of elementary events. The whole art of 
probability is to make a judicious choice of a manner of 
breaking up the event being considered. In this case we 
break up the event A according to when the first run 
of HH occurs. Let A be the event "the run HH occurs 



first at the n toss and the run TT has not yet occurred." 
The event A is the (infinite) union of all the A n ' s ' and 

in turn each A can be expressed in terms of the elementary 
n 

events as follows: 



n 



A 



2 



H n r*H 



2 



HH 



A 



3 



H AH^H 



THH. . . 



A 



4 



H 1 AH 2 *H 3 flH 4 



HTHH. . . 



A 



5 




THTHH . . . 



etc . 
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Note that not only is A the union of the A n ' s but also 
none of the A n 's overlap with any other. In other words no 
sample point of A has been counted twice. This property will 
be very important for probabilistic computations. 

As an exercise one might try to calculate the expression, 
in terms of elementary events, of the event " a run of k heads 
occurs before a run of n tails occurs." Later we will develop 
tools for computing the probability of such an event quite easily, 
and this exercise will quickly convince one of the power of these 
tools . 

3. The Algebra of Multisets 

We now go back to our study of set theory. Our objective 
is to extend the concept of a set by allowing elements of sets 
to be repeated. This more general concept is called a multiset . 
To give an example, suppose that a committee of 10 members has an 
election to determine its chairperson. Of the votes that are 
cast, 7 are for candidate A, 2 for B and 1 for C. The set of 
votes is most easily expressed as a multiset consisting of 10 
elements: 7 of type A, 2 of type B and 1 of type C. In set- 
builder notation we write this {a,a,a,a,a,a,a,b,b,c} . We can 

7 2 1 

write this more economically as {a ,b ,c } , the exponents 
denoting the number of copies of the element that are in the 
multiset. Notice that a set is a special kind of multiset. 
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As with sets, we can combine multisets to form new 
multisets. In some ways these operations are more natural 
than the analogous ones for sets. The operations are addition 
and multiplication . In the exercises we describe one more 
operation: subtraction . Given two multisets M and N, 
their sum M+N is obtained by combining all the elements of 
M and N, counting multiplicities. For example if a occurs 
three times in M and twice in N, then it occurs five times 
in M+N. The product MN of M and N is obtained by 
multiplying the multiplicities of elements occuring in both 
M and N. For example if a occurs three times in M and 
twice in N, then it occurs six times in MN. Here are some 
more examples: 

{a,a,a,b,b} + {a,b,b,b,c} = {a,a,a,a,b,b, b,b, b, c} 
{a,a,a,b,b} • {a,b,b,b,c} - {a,a,a,b,b, b,b, b, b} , 
{a,b} + {b,c} = {a,b,b,c} 
{a,b} • {b,c} = {b} 

or using exponent notation: 

{a 3 ,b 2 } + {a^b^c 1 } = {a^b 5 ,^} 

{a 3 ,b 2 } • {a^b^c 1 } = {a 3 ,b 6 } . 

{a 1 ^ 1 } + {b^c 1 } = {a 1 ^ 2 ^ 1 } 

{a 1 ^ 1 } • {b 1 ^ 1 } = {b 1 } 
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When A and B are two sets, it now makes sense to 
speak of their sum A+B and their product AB . What do 
these mean in terms of sets? The product is easy to describe: 
it coincides precisely with the intersection A<\B . For this 
reason it is quite common to write AB for the intersection 
of two events. On the other hand, the sum of two sets is not 
so easy to describe. In general A+B will not be a set even 
when both A and B are. The reason is that those elements 
occurring both in A and in B will necessarily occur twice 
in A+B. However if A and B are d isjoint , that is when 
AAB is empty, then A+B is a set and coincides with AuB. 
As this situation is quite important in probability, we will 
often write A+B to denote the union of A and B when 
A is disjoint from B, and we will then refer to A+B as 
the disjoint union o f A and B. 



4. The Concept of Probability 

Consider once again the election multiset introduced in the 
last section: {a^b 2 ^ 1 } . What percentage of the votes did each 
of the candidates receive? An easy calculation reveals that A 
received 70% of the total, B received 20% and C received only 
10%. The process of converting "raw counts" into percentages 
loses some of the information of tne original multiset, since the 
percentages do not reveal how many votes were cast. However, the 
percentages do contain all the information relevant to an election. 
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By taking percentages we are replacing the complete information 
of the number of votes cast for each candidate by the information 
of what proportion of votes were cast for each candidate relative 
to the total number of votes cast. The concept of probability is 
an abstraction derived from this situation. Namely, a probability 
measure on a set tells one the proportion or size of an element 
or a subset relative to the size of the set as a whole. We may 
intuitively think of a probability as an assignment of a non- 
negative real number to every element of the set in such a way 

7 2 1 

that the sum of all such numbers is 1. The above multiset {a ,b ,c 
gives rise to a probability measure which will be denoted in the 
following manner. For every subset S of {a,b,c} , we write P(S) 
for the proportion of the multiset {a^b^c 1 } , which has elements 
from S . For example, P({a}) is 0.7 because 70% of the elements 
of {a^b 2 ^ 1 } are a's. Similarly, P({a,b}) is 0.9, P({b,c}) 
is 0.3, P({a,b,c}) is 1.0 and so on. We call P a probability 
measure. It is important to realize that P is defined not on 
elements but on subsets . We do this because we observed that events 
are subsets of the sample space, and we wish to express the concept 
of a probability directly in terms of events. As we have seen it 
is easier to think directly in terms of events rather than in terms 
of sets of outcomes. For this reason we henceforth decree that a 
probability measure P on a sample space ft is a function which 
assigns a real number P (A) to every event A of ft such that 

(1) P(A) > 

(2) P(ft) = 1 

(3) If A"l,A 2 , ■■• is a sequence of disjoint events, then 
P(A 1 +A2 + *'') = PtA]^) +P(A 2 ) +••■ or more compactly; 



At first it may not be easy to see that these three conditions 
capture the concept of "proportion" we described above. The first 
two conditions however are easy to understand: we do not allow 
outcomes to occur a negative number of times, and the measure of 
Q itself is 1 because it is the totality of all possible outcomes. 
It is the third condition that is the most difficult to justify. 
This condition is called countable additivity . When the sequence 
of events consists of just two events A and B , it is obvious. 
Let C be the union A (J B . Since A and B are assumed to be 
disjoint, C is the same as A + B . Probabilistically this says 
that A and B are mutually exclusive alternatives for C : it 
occurs if and only if exactly one of A or B occurs. Clearly 
if this is so then the probability of C is "distributed" between 
A and B , i.e. P (C) = P (A) + P (B) . The extension of this rule 
to an infinite sequence of events is somewhat unintuitive, but one 
can get used to it when one sees concrete examples. 

Properties of Probability Measures 

We now show three important facts about probability measures. 
These facts relate the concept of probability to the Boolean concepts 
of subevent, union, intersection and complement. 

Subevents . If A i£ a subevent of B , then P(A) < P(B) . 
Although this should be intuitively clear, we will prove it from 
the three conditions for P to be a probability. First observe 
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B\A is shaded 



that A C B means that B is the disjoint union of A and Bv A , 

where B\A denotes B f) A . This 
should be clear from the Venn diagram, 
or just think of what it says: every 
element of B is either in A or it 
is not and these alternatives are 
mutually exclusive. Therefore condition 
(3) implies that 
P(B) = P(A + (B\A)) = P(A) + P(BsA) . 
By condition (1), P(B^A) _> . Therefore, 

P(B) = P(A) + P(B>^A) >_ P(A) . 
As a consequence we find that since every event A is a subevent 
of Q , 

< P(A) < P(ft) =1 . 
This corresponds to our intuitive feeling that probability is a 
measure of likelihood, ranging from extremely unlikely (zero or 
near zero) to extremely likely (1 or close to 1) . 



Union and Intersection . If A and B are two events , then 
P(AljB) = P(A) + P(B) - P(aOb). 

To prove this we first write A B as a disjoint union of atoms. 

/^"/*\^>v From the Venn diagram it is clear that 

A I [J J B A|J B = (AOB) + (A\B) + (B\ A) . 

Similarly, we can write A and B as (disjoint) unions of atoms: 
A = (AOB) + (A\ B) 
B = (Afl B) + (B\A) . 
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By condition (3) , 

P(A|jB) = P(Af|B) + P(A\B) + P(B\A) 

P(A) = P(A OB) + P(A\ B) 

P(B) = P(AP| B) + P(B\A) . 
Now solve for P(A\B) and P(BVA) in the last two expressions 
and substitute these into the first. This gives our formula. 
The usefulness of this formula is that it applies even when A 
and B are not disjoint. 

Here is a concrete example. Suppose that we have two coins. 
Let A be the event that the first shows heads, and let a = P(A) 
be the probability of this. Similarly let B be the event that 
the second shows heads, and let b be P(B) . What is the 
probability that when we toss both of them at least one shows 
heads? Clearly we want P(a(Jb) . By the above formula, we find 
that P (A (J B) = P (A) + P (B) - P (A p| B ) = a + b - P (A O B) . 
However, we do not yet know how to compute P(aOb) in terms of 
P(A) and P(B) . We will return to this problem in the next 
section . 

Complement . If A is_ an event , then P (A) = 1 - P(A) . 
To see this simply note that £2 is the disjoint unior of A and A . 
By conditions (2) and (3), we have 1 = P(fl) = P(A + A) = P(A) + P (A). 
Thus we see that the probability for an event not to occur is 
"complementary" to the probability for its occurrance . For example, 
if the probability of getting heads when we toss a coin is p , then 
the probability of getting tails is q = 1 - p . 
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5. Independent Events 

The notion of independence is an intuitive one derived from 
experience: two events are independent if they have no effect on 
one another. More precisely if we have two independent events A 
and B, then knowing A has occurred does not change the probabil- 
ity for B to occur and vice versa. When we have the notion of 
conditional probability we can make this statement completely 
rigorous. Nevertheless even with the terminology we have so far, 
the concept of independence is easy to express. We say two events 
A and B are independent when 

P(aOb) = P(A)P(B) 

If we use multiset notation, writing AB for Af|B, then this 
rule is very suggestive: P(AB) - P(A)P(B). It is important to 
realize that only independent events satisfy this rule just as 
only disjoint events satisfy additivity: P(A + B) - P (A) + P(B). 

Consider the case of coin tossing. The individual tosses of 
the coin are independent: the coin is the same coin after each 
toss and has no menory of having been tossed before. As a result, 
the probability of getting two heads in two tosses is the square 
of the probability of getting one head on one toss. 
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As an application consider the two-coin toss problem in the 
last section. Since we are tossing two different coins, it seems 
reasonable to expect A and B to be independent. Therefore 
P(aOb) - P(A)P(B) = ab. Thus 

P(AUB) = P(A) + P(B) - P(a0b) 
= a + b - ab. 

We conclude that the probability for one of the coins to show 
heads is a + b - ab. 

For any three events A,B and C, we say these events are 

independent when: 

(1) any pair of the three are independent, 

(2) P(AOBnC) = P(A)P(B)P(C) . 

It is possible for three events to satisfy (1) but not (2). This 
is an important point that is easily missed. Consider again the 
two-coin toss problem above. Let C be the event that the two 
coins show different faces (one heads the other tails) . Then 
A,B and C are pairwise independent; for example, knowing that 
the first coin shows heads tells one nothing about whether the 
other will be the same or different. However the three events 
are not independent: the occurrence of any two of them precludes 
the third from occurring. 

Similarly given any number of events (even an infinite number) 
we say that they are independent when 

P(A i n A 2 OA n ) = P(A 2 )P(A 2 ) P(A n ) 

for any finite subcollection A 1# A 2 r • • • * A n of the events. 



6 . The Bernoulli Process 

This is the process of tossing a biased coin. In a given 
toss we suppose that the probability is p for heads and q for 
tails, where p+q = 1. Generally speaking we will also be 
implicitly assuming that both p and q are nonzero, but other 
than this we shall make no restriction on what value p could have. 
We call p the bias of the coin. A fair coin is a special kind 
of biased coin; namely, one with bias P = "2" * 

We want to assign a probability to every elementary event and 
show how this allows us to compute the probability of every other 
event. This done in two steps. 

Step 1 . P(H n ) = p and P (T R ) = P (H^) = q = 1-p . This assignment 
is made irrespective of n . In effect we assume that we are 
using the same coin (or at least have the same bias) during each 
toss . 

We have now defined the probability of the elementary events. 
But we still cannot determine the probability of an arbitrary event 
because none of the three conditions determines what, for example, 
PtH-j^O H 2^ can be ' although they limit the possible choices. 
This leads us to our second assumption: 

Step 2 . Pdl^nH^n •••OH^ n ) = Vin] 1 ) PtH^ 2 ) •••P(H^ n ) , where 
i lf i 2 , i n take on all possible choices of ± 1 . 

Here we have drawn from the physical nature of the phenomenon of 
tossing a coin. The question of whether tosses of a real coin are 
independent is another question entirely: the question of the 
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validity of the model as a means of describing the actual physical 
experiment. We will not consider this question until Chapter IV. 

For any other event A of the Bernoulli process, the 
probability is calculated by expanding A in terms of the elementary 
events (possibly using countable unions) and by using conditions 
(1), (2) and (3). It is not always obvious what the best way to 
do this might be. There is an "art" to this part of the subject, 
and developing one's skill in this art is the whole idea of 
learning probability. 

Let us return to the event: "a run of HH occurs before 
a run of TT." Recall that this event can be expanded as a disjoint 
union : 

a = ( Hi oh 2 ) + (T 1 riH 2 nH 3 ) + ( Hl n t 2 oh 3 oh 4 ) 

By countable additivity, we compute: 

P( A ) = po^n h 2 ) + p(T 1 OH 2 nH 3 ) + p(H x n T 2 nH 3 nv + 

= P(H 1 )P(H 2 ) + P( Tl )P(H 2 )P(H 3 ) + P(H 1 )P(T 2 )P(H 3 )P(H 4 ) + 

= p 2 + qp 2 + pqp 2 + qpqp2 + *" 

= p 2 + pqp 2 + • * * 

2 . __2 



+ qP z + qpqp + ■ 

i 

= p 

+ qp" 



2 (i + pq + (pq) 2 + •* •) 
2 (i + pq + (pq) 2 + •••) 



2 i 1 
= p • + qp • 



l - pq i - pq 

1 - pq 



p 2 + qp2 



We are assuming here that we know how to sum a geometric series: 

1 + r + r + r + ••• = t^- , if |r| < 1 . 

i-r 

We can check our computation of P (A) by a simple expedient: 
suppose that the coin is fair , i.e. p = q = 1/2 . In this case 
P(A) = 1/2, for either HH occurs before TT or TT occurs 
before HH , and since the coin is fair either one of these is 
equally likely. And indeed setting p = q = 1/2 in our formula 
above shows that this is the case. 



Probability measure: a function P on events such that 

(1) for every event A, P(A) > 

(2) P(A) - 1 

(3) if A 1 , A ?i ... is a sequence of disjoint events, 

then P(A 1 +A 2 +---) = P(A 1 )+P(A 2 )+--- 

Properties of probability measures: 

(1) if A and B are disjoint events, then P(A+B)=P(A)+P(B) 

(2) if A and B are independent, then P(AB)=P(A)P(B) 

(3) if A is a subevent of B, then P(A) < P(B) 
(k) if A and B are any two events, then 

P(A\JB) - P(A) + P(B) - P(AOB) 
(5) if A is any event, then P(A) = 1-P(A) 

Probability Measures 
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7. Exercises for 

Chapter I Sets, Events and Probabilty 
The Algebra of Sets and Multisets 

1. If A and B are sets, the stroke of A bj£ B , written 
A\B, stands for aP|b, i.e. those elements of A that are 
not in B . As an event A\ B stands for "A occurs but B does 
not." Show that the operations of union, intersection and com- 
plement can all be expressed using only the stroke operation. 

2. The symmetric difference of A and B , written AAB, is 
defined by 

A A B = (A\B)|J(B\A) . 

As an event A A B means "either A occurs or B occurs but 
not both." For this reason this operation is also called the 
"exclusive or." Use a Venn diagram to illustrate this operation. 

3. The set of elements where A implies B , denoted A/B , is 
the set 

A/B = A|jB . 

As an event A/B stands for "if A occurs then B does also." 
Use a Venn diagram to illustrate this operation. 

4. Using Venn diagrams prove the following: 

(a) ( A/B ) | 1 (B/C ) CZ A/ C , i.e. if A implies B and B 
implies C , then A implies C . 

(b) (A/B) | | (A/C) = A/(b( |c) , i.e. A implies B and A 
implies C if and only if A implies B and C . 

(c) (A/B) P| (B/A) = . 
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5. Show that for any four sets A, B, C and D, the following 
is true: ( A |J B) \ ( C |J D) C ( A \ C ) |J ( B \ D ) . 

6. Prove that any Boolean (logical) expression in 

A, , A« . • • • • A is a union of atoms. In what sense is this 
1 3 2 ' 3 n 

union unique? 

7. Let B be a multiset. We say that A is a sub - multiset 
of B if every element of A occurs at least as many times in 
B as it does in A . For example, {a, a,b} is a sub-multiset 
of {a,a } b,b 5 c) but not of {a, b, b,}. When A is a sub- 
multiset of B , it makes sense to speak of the difference of 

A and B , written B-A ; namely define B-A to be the unique 
multiset such that A + (B-A) = B . For example, 
{a, a, b, b, c} — {a, a, b} = {b 3 c} . Suppose henceforth that A 
and B are sets. When does B-A coincide with B \ A ? Prove 
that a(_Jb = A + B-AB. Compare this with property (b) in 
section 1 . 3 • 

The Bernoulli Sample Space 

8. Give an explicit expression for the event "a run of three 
heads occurs before a run of two tails" in terms of elementary 
Bernoulli events. Suggest how this might be extended to "a run 
of k heads occurs before a run of n tails." 

The Concept of Probability 

9. In a certain town there are exactly 1000 families that 

have exactly three children. Records show that 11.9% have 3 boys, 

36.9% have 2 boys, 38.1$ have 2 girls and 13. 1? have 3 girls. 
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Use a multiset to describe this situation. Give an interpre- 
tation in terms of probability. What is the probability that in 
one of the above families all the children have the same gender? 

10. In a factory there are 100 workers. Of the total, 65 are 
male, 77 are married and 50 are both married and male. How many 
workers are female? What fraction of the female workers are 
married? Ask the same questions for male workers. 

11. Express P(aUb(J c ) ln terms of P(A) > P(3) ' P(C) ' 
P(aOb), PIAPIC), P(BriC) and PIaObOc) . Note the 
similarity of this expression with that of property (b) of 
se ct ion 1 . 3 • 

12. Let D 1 be the event "exactly one of the events A, B and C 
occurs." Express P^) in terms of P(A), P(B), P(C), P(aH3), 
P(aOc), P(bOc) and P(aObOc). 

13? Condition (3) for a probability measure can be stated in 
several other ways. Prove that condition (3) implies each of 
(a) and (b) below and that either o.f these imply condition (3)- 
(a) If A-L C A 2 C A 3 CL *** ls an ascending sequence 
of events and if A=A ] _0 A 2 I J A 3U "> then 
P(A) = limP(A ) . 



n-j-oo 



(b) If A 1 Z) A 2 Z) A 3 Z) *** i s a descending sequence 
of events and if A = A]_ O A 2 O A 3 H ' * * > tnen 



P(A) = limP(A n ) 
n+°° 
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Independent Events 

14. If A and B form a pair of Independent events, show 
that the pair A, B, the pair A, B and the pair A, B are 
each a pair of independent events. 

15. In exercise 10, are the properties of being male and of 
being female independent? Is the property of being male inde- 
pendent of being married? 

16. The probability of throwing a "6" with a single die is 1/6. 
If three dice are thrown independently, what is the probability 
that exactly one shows a "6"? Use exercise 12. 

17. A student applies for two national scholarships. The 
probability that he is awarded the first is 1/2, while the 
probability for the second is only 1/4. But the probability 
that he gets both is 1/6. Are the events that he gets the 
scholarships independent of one another. Discuss what this means. 

18. A baseball player has a 0.280 batting average. What is the 
probability that he gets exactly one hit in the next three times 
at bat? See exercise 16. To do this exercise one must assume 
that the player's times at bat are independent. Is this a 
reasonable assumption? 

19. Three pigeons have been trained to peck at one of two 
buttons in response to a visual stimulus and do so correctly with 
probability p . Three pigeons are given the same stimulus. 
What is the probability that the majority peck at the correct 
stimulus? Suppose that one of the pigeons sustains an injury 
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and subsequently pecks at one or the other button with equal 
probability. Which is more likely to be the correct response, 
the button pecked by one of the normal pigeons or the button 
pecked by a majority of the three pigeons? 

20. How many times must one roll a die in order to have a 99£ 
chance of rolling a "6"? (Answer: 26 times.) If you rolled 
a die this many times and it never showed a "6", what would 
you think? 

The Bernoulli Process 

21. A dice game commonly played in gambling houses is Craps. 
In this game two dice are repeatedly rolled by a player, called 
the "shooter," until either a win or a loss occurs. It is 
theoretically possible for the play to continue for any given 
number of rolls before the shooter either wins or loses. Com- 
puting the probability of a win requires the full use of 
condition (3). Because of the complexity of this game, we will 
consider here a simplified version. 

The probability of rolling a "V using two (fair) dice is 
1/12, and the probability of rolling a "7" is 1/6. What is the 
probability of rolling a M" before rolling a "7"? This prob- 
ability appears as part of a later calculation. 

22. (Prendergast ) Two technicians are discussing the relative 
merits of two rockets. One rocket has two engines, the other 
four. The engines used are all identical. To ensure success 
the engines are somewhat redundant: the rocket will achieve its 
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mission even if half the engines fail. The first technician 
argues that the four-engine rocket ought to be the better one. 

The second technician then says, "Although I cannot reveal 
the failure probability of an engine because it is classified 
top secret, I can assure you that either rocket is as likely to 
succeed as the other." 

The first technician replies, "Thank you. What you just 
told me allows me to compute the failure probability both for 
an engine and for a rocket." 

Can you do this computation also? 

23? Let A be the event in the Bernoulli process that the 
coin forever alternates between heads and tails. This event 
consists of just two possible sample points. 

A = { HTHTH« • • , THTH* • • } . 
Using exercise 13, prove that P(A) = . Does this mean that 
A is impossible? More generally if an event B in the Bernoulli 
process consists of only a finite number of sample points, then 
P(B) = . 
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Chapter II Finite Processes 



Historically the theory of probability arose from the 
analysis of games of chance, usually based on the toss of a die 
or the choice of a card from a deck. For this reason the oldest 
probability models have only a finite number of sample points. 
In this chapter we introduce the techniques for computing 
probabilities of events in a finite process. We will later see 
that these techniques are useful in many other processes as well. 

1 . The Basic Models 

The basic situation for a finite process is the following. 
Imagine that we have a population consisting of individuals . They 
may be people, cards or whatever. Mow choose one individual from 
the population. How many ways can we do this? The answer, of 
course, is the number of individuals in the population. The 
individual we chose is called a sample ( of size 1 ) from the 
population. More generally suppose that we choose not just one 
but a whole sequence of k individuals from the population. The 
sampling procedure we envision here is successive: we choose one 
individual at a time. How many ways can we do this? The answer 
will depend on whether the same individual can be chosen more 
than once, or equivalently on whether a chosen individual is 
returned to the population before another is chosen. The two 
kinds of sampling are called sampling with replacement and 
sampling without replacement . The sequence of chosen individuals 
is called a sample of size k. 



To illustrate this we consider dice and cards, A single 
roll of a die samples one of its six faces. If we roll the die 
k times, or equivalent!;/ if we roll k dice, we are choosing a 
sample of size k with replacement from the population of six 
faces. Similarly, a choice of one card from the standard 52 
card deck of cards is a sample from a population of 52 individuals. 
However, if we deal out k cards, we are choosing a sample of size 
k without replacement from the population of cards. Note that in 
a sample the order matters, so that a sample of cards is not the 
same as a hand of cards, for which the order does not matter. 

The description of a finite process given above is called 
the sampling model. It is by no means the only model or the 
best model for a given situation. For the rest of this section 
we consider several other models all of which are mathematically 
equivalent . 

The occupancy model is the following. We have k balls or 
marbles and n boxes. Denote the set of balls by B and the set of 
boxes by U (for urns ) . A placement is a way of placing the balls 
into the boxes, each ball in some box. For example, here is a 
placement of k balls into 5 boxes: 




A B C D E 



In the distribution model we have an alphabet U whose members 
are called letters . A word of length k is any sequence of k 
letters from the alphabet. The distribution model is easily seen 
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to be equivalent to the occupancy model: the letters correspond 
to the boxes and the positions of the letters in a word correspond 
to the balls. 




ACBC 



B C 



A placement of The corresponding 

k balls into 3 boxes k letter word. 

If we regard the alphabet U as a population whose individuals 
are letters, then it is easy to see that a word is just a sample 
with replacement. Therefore the distribution and occupancy 
models are both equivalent to the sampling model with replacement. 
In terms of the occupancy model, sampling without replacement 
means that no box has more than one ball. In terms of the distri- 
bution model, this means that no letter appears twice in a word. 

The Mathematical Model . In mathematics a placement of 
balls is called a function from B to U. Sampling without 
replacement corresponds to one-to-one functions. 

The Physics Model . In physics the balls are called 
particles , and the boxes are called states , while a place- 
ment is called a configuration or system . For a given 
placement, the occupation number of box i , called 0^, 

is the number of balls placed in box i. The occupation 

n 

numbers trivially satisfy E i = k = |b|. For example 

i=l 

saying that the placement corresponds to a one-to-one 
function (or sampling without replacement) is the same as 
saying i equals or 1 for all i . In physics such a 
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restriction is called an exclusion principle : a given state 
may have at most one particle in it. The physics model we 
have described here goes by the name of Maxwell -Bo ltzman 
statistics (with or without the exclusion principle) . We 
shall see other statistics in later sections, which are 
more physically realistic. 

At this point we can make a dictionary for translating 
terms from one model to another. 



Model 

Occupancy 

Distribution 

Sampling 

Mathematics 

Physics 
Astrology 



Terminology (terms in one column are equivalent) 



placement 
word 

ordered sample 
function 
configuration 
horoscope 



balls 

places 

position 

particles 
planets 



3 



domain 



solar system 



Model 



Occupancy 



Terminology 
boxes 



U 



at most one ball per box 



Distribution letters 



alphabet no repeated letters 



Sampling 

Mathematics 
* 

Physics 



individuals 



states 



population without replacement 
range one-to-one function 

exclusion principle 



Astrology 



signs 



Zodiac horoscope 



* Maxwell-Roltzman statistics 

2 . 4 



2. Counting Rules and Stirling's Formula 

Placements in their many variations and equivalent forms 
are the most commonly encountered objects in probability 
computations. A roll of dice, a hand of cards, even a con- 
figuration of particles in chemistry or physics are all forms 
of placements. In this section we will concentrate on the 
most basic rules for counting collections of placements. In 
section four we will consider the more subtle 
kind of counting necessary in the atomic and sub-atomic do- 
mains of chemistry and physics. 



The First Rule of Counting 




The most 


fundamental rule of counting is one 


so obvious 


that it doesn 


't seem necessary to dwell on it: 




First Rule of 


Counting. If an object is formed by 


making a 


succession of 


choices such that there are 




n i 


possibilities for the first choice 




n 2 


possibilities for the second choice 






etc . 




Then the total number of objects that can be made 


by 


making a set 


of choices is 






n l * n 2 * n 3 * " " " " 
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fJoto that by a "succession of choices" wn mean that after 
the first choice is made, there are choices for the 
second choice, and similarly for subsequent choices. 

We illustrate this rule with some familiar examples. 
Throwing Dice . How many ways can we roll three dice? A 
roll of three dice requires three "choices": one for each 

die . Since each die has six possibilities, there are 

3 

6 = 216 rolls of three dice. 

Notice that it does not matter whether we view the three 
dice as being rolled one at a time or all at once. We will 
call such choices independent : the rolls of the other dice 
do not affect the set of possibilities available to any 
given die. 

Dealing a Poker Hand_._ How many ways can we deal five cards 

from a deck of 52 cards? (We consider as different two hands having 

the same cards but dealt in different orders.) A deal of 

five cards consists of five choices. There are 52 choices 

for the first card, 51 choices for the second card, etc. 

The total number of deals is then 52 • 51 • 50 • 49 • 48 = 311,875,200. 

Unlike the situation for rolls of three dice, the cards 

dealt in the above example are not independent choices. The 

earlier cards we dealt do affect the set of possibilities for 



2.6 



later cards. However the earlier cards do not affect the 
number of possibilities for a given later deal of a card. 
Hence the first rule still applies. 

Before we consider more complex counting problems we 
restate the above two examples in the general language of 
distribution and occupancy. 

Arbitrary Placements . The total number of ways to place k 

balls into n boxes or equivalently the total number of k- 

k 

letter words made from an alphabet of n letters is n : 
each ball or letter can be chosen in n ways independently 
of the others. 

Placements no two in one box. The total number of ways to 
place k balls into n boxes so that no two occupy the same 
box or equivalently the total number of k-letter words made 
from an alphabet of n letters and having no repeated letters 
is 

(n) k = n(n-l) . . . (n-k+1) 

There are k factors in this product, one for each choice. 
This product is called the lower or falling factorial . 

An important special case of the second formula is 

the one for which k = n. Such a placement has a special 

name: it is a permutation . For example, if we deal all 52 

cards from a standard deck, there are (52) 52 = 52 • 51 • 50 • * • 3 • 2 • 1 

ways for this to happen. This is a very large number, and 

we will discuss techniques for approximating it below. 

Permutations occur so frequently in computations that 

we have a special notation for the total number of them. 
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Definition. The total num b er f ways to place n balls 
into n boxes, each ball in a different box, or equivalently 
the number of n-letter words using all the letters from an 
n- letter alphabet is called n-factorial and is written 



n! = (n) n = n(n-l) . . .3-2-1. 



Arbitrary placements of k balls into n boxes n k 

Placements no two in one box of 

k balls into n boxes (n), 

k 

n balls into n boxes n!=(n) 

n 



Table 1: Placements 



Stirling's Formula . 

The computation of factorials is so common in probability 
that it is a great relief to learn that there is an easy way 
to compute them. The method makes use of an approximation 
known as Stirling's Formula. The precise mathematical state- 
ment is the following: 



lim 

n-*oo 



r n -n /r, — — >v 

[ si J " '* 



but in practice this is what one uses 



. n -n m — 
n I = n e v2irn 



Stirling's Formula 
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The symbol "-"means "approximately equal to", and in practice 
in any expression involving large factorials, we replace each 
factorial by the right-hand side of Stirling's Formula above. 

For example, the total number of permuatations of a 
standard deck of cards is approximately: 

52 52 e~ 52 s/lOifTr - 8.053 X10 67 



The Second Rule of Counting . 

Poker Hands . Anyone who has played cards knows that one 
normally does not care about the order in which one is 
dealt a hand of cards. That is, a hand of cards is not the 
same as a deal of cards. A poker hand is defined to 

be a subset of five cards from a standard deck, and the 
order in which the cards are obtained is immaterial. We 
cannot count the number of poker hands using the first rule 
of counting because they violate the fundamental premise of 
that rule: the object must be obtained by successive choices 
However, every poker hand can be dealt in precisely 5! = 120 
ways . Therefore the number of poker hands is 

52-51-50-49-48 _ 311,875,200 = 2(598/960 

This illustrates the second rule of counting, also called 

the "shepherd principle": if it is too difficult to count 

sheep, count their legs and then divide by four. 
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Second Rule of Counting . If we wish to count a set of 
objects obtained by making choices but for which the order 
of choice does not matter, count as if it did matter and 
then divide by a factor: the number of ordered objects per 
unordered object. 



Let us illustrate this in a more complicated situation. 

Bridge Games . How many bridge situations are possible? 

By definition a bridge game consists of four people being 

dealt 13 cards each. However, the order in which each of 

the four hands is dealt does not matter. So we first count 

as if the order did matter: this gives 521 possible deals. 

But each hand can be dealt 13 i ways , and there are four 

4 

hands for a total of (13!) ways to deal a given bridge 
game. Therefore there are 

521 . = 5.36447 * 10 28 
(13i) 4 

possible bridge situations. The symbol means "ap- 

proximately equal to." 

One must be careful when applying the second rule to 
make certain that the number of ordered objects is the same 
for any unordered object. In Sectt'o* ^ we will give 
an example of this kind of difficulty. Meanwhile you are 
now equipped to perform all the (unstarred) counting compu- 
tations in the exercises. 



Before we end this section we cannot resist one more 
generalization suggested by the above bridge example. Using 
the language of the occupancy model, a bridge game consists 
of 52 balls being placed in four boxes such that every box 
has exactly 13 balls. More generally, the number of ways 
that k balls can be placed in n boxes so that balls are 
in the first box, 9 2 balls are in the second box, etc. is, 
by the second rule of counting, 



Note that for the above formula to make sense we must have 

e + e^+...+9 = k. This expression, called the multi - 
1 2 n 

nomial coefficient, is written 



and is prounced "k choose Q 1 ,9 2 , . . . ,9 n . " The numbers 
Q lt Q 2 , . . . ,Q n are called the occupation numbers of the place- 
ment . 

An important special case of the multinomial coef- 
ficient is the case n = 2, when it is called the binomial 
coefficient. This number should be a familiar concept from 
the binomial expansion in algebra and calculus. Because a 
placement of balls into two boxes is completely determined 
by the choice of those in one of the boxes, we can also 



k! 



12 n 
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interpret the binomial coefficient (~ a ) as the number of 



t e) 

^l' y 2 



9^-element subsets of a k-element set. The binomial coef- 

k / k \ 

ficient is often abbreviated to (- ) = ( Q & ) r and is then 

B l B l' y 2 

pronounced simply "k choose ©2. "* Usin 9 this notation we 

can quickly compute the number of poker hands because a poker 

hand consists of five cards chosen from 52 cards: 

(521 

.52. , 52 , 521 _ KDZ) 5 
( 5 J ~ { 5,AV ~ 5TT7T 5T~ 



Furthermore, we can use binomial coefficients and the 
first rule of counting to find another formula for the 
multinomial coefficient. A placement of k balls into n 
boxes with occupation numbers 9^ ,9 2 , . . . ,Q can be made using 

n-1 choices: choose 9^ balls from the k balls and put them 

in the first box, choose 9 2 balls from the remaining k-9^ 

balls and put them in the second box,..., choose ® n _± balls 

from the remaining k-9^- . . . ~ Q n _ 2 balls and put them in the 

next to last box, and finally put the last 9 = k-9, -...-9 , 

J ^ n 1 n-1 

balls in the last box (no more choice is necessary) . Therefore 

k— 9^ \ ik — 9t - ■ • • • ~9 



W' 9 2 V " K' \ 9 2 9 n-l ' 
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(k i 
1 
Pi ' Q 2 ' " " " ,0 nJ 

Subsets of size of a set of k balls (q) = k_e) 

Relationship between multinomial and binomial coefficients 

/ k \ /k \ / k " 9 l\ / k ~ e i~"-""" e n-2\ 

W' e 2 V " V U 2 Vl > " 



Table 2: Multinomial and Binomial 
Coefficients 



3 . Computing Probabilities . 

Consider a finite sample space ft. The events of ft are the 
subsets A C ft. We would like to see what a probability measure 
P on ft means. Remember that P is defined on events- of ft. 
An event A C ft may be written as a finite set 

A = {(jo^ , (1)2 > • . . , a) } 
and by additivity any probability measure P must satisfy 
P(A) = P({o) 1 » + P({o) 2 }) + . - .+P({o3 n }) . 

We call the events {w}, having just one outcome we ft, the 
atoms of ft. It is important to distinguish between an outcome 
and an atom: the atom {m} of an outcome a) is the event 
"a) occurs". The distinction is roughly the same as that between 
a noun and a simple sentence containing it, e.g., between "tree" 



and "This is a tree." 

What we have just shown above is that every probability 
measure P on a finite sample space Q is_ determined by its 
values on atoms . The value on an arbitrary event A C 0. is 
then computed by the formula: 

P(A) = I P(U}) . 
we A 

The values of P on the atoms may be assigned arbitrarily so 
long as: 

(1) For every atom {go}, < P({w}) < 1, 

(2) I P({u)}) = 1. 
we Q 

Whenever (1) and (2) hold, P defines a consistent probability 
measure on Q. 

The simplest example of a probability measure on a finite 
sample space Q is the one we will call the equally likely 
probability ; it is the unique probability measure P on 0, 
for which every atom is assigned the same probability as any 
other. Hence for every atom {oo}, P({oj}) = For more 

general events A C ft, this probability has a simple formula: 

p,A\ _ I A 1 _ no. of outcomes qj e A 

^ ^ \Q\ total no. of outcomes in Q~ 

The equally likely probability is quite common in gambling 

situations as well as in sampling theory in statistics, although 

in both cases great pains are taken to see to it that it really 
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is the correct model. For this reason this probability has some- 
thing of an air of artificiality about it, even though it or 
probability measures close to it do occur often in nature. 
Unfortunately nature seems to have a perverse way of hiding the 
proper definition of from casual inspection. 

The phrases "completely at random" or simply "at random" 
are used to indicate that a given problem is assuming the 
equally likely probability. The latter phrase is misleading 
because every probability measure defines a concept of randomness 
for the sample space in question. Even certainty for one outcome 
is a special case of a probability measure. In Chapter VII we 
will justify the use of the description "completely random" for 
the equally likely probability measure. For now we consider some 
concrete examples of this probability measure. 

Rolling Dice . What is the probability that in a roll of three 
dice no two show the same face? We already computed \q\ in the 
last section: \q\ = 216. The event A in question is "the 
faces of the three dice are all different." We think of an 
outcome in A as a placement of three balls into 6 boxes so that 
no two balls are in the same box. There are (6)^ = 6*5«4 = 120 
placements with this property. Hence 



P(A) = 



(6)3 



120 



= .555. . . 
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Birthday Coincidences . If n students show up at random in a 
classroom, what is the probability that at least two of them have 
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the same birthday? In order to solve this problem we will make 
some simplifications. We will assume that there are only 365 days 
in every year; that is, we ignore leap years. Next we will assume 
that every day of the year is equally likely to be a birthday. 
Both of these are innocuous simplifications. Much less innocuous 
is the assumption that the students are randomly chosen with 
respect to birthdays. What we mean is that the individual students' 
birthdays are independent dates. 

Let B be the event in question, and let A = B be the 

complementary event "no two students have the same birthday." 

Now just as we computed in the dice rolling problem above, 

(365) (365) 

p( A ) = D_ and hence P(B) = 1 - ~- . These probabilities 

365 n 365 n 

are easily computed on a hand calculator. Here are some values: 



n = 20, 


P(A) 


= 0. 


.5886, 


P(B) 


= 0, 


.4114 


n = 22 , 


P(A) 


= 0, 


.5243, 


P(B) 


= 0. 


.4757 


n = 25, 


P(A) 


= 0, 


.4313, 


P(B) 


= 0, 


.5687 


n = 30, 


P(A) 


= 0, 


.2937, 


P(B) 


= 


. 7063 



So in a class of 30 students the odds are 7 to 3 in favor of at 
least two having a common birthday. 

Ra ndom Committees . In the U.S. Senate a committee of 50 senators 
is chosen at random. What is the probability that Massachusetts 
is represented? What is the probability that every state is 
represented? 
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In any real committee the members are not chosen at random. 
What this question is implicitly asking is whether random choice 
is "fair" with respect to two criteria of fairness. Note that 
the phrase "at random" is ambiguous. A more precise statement 
would be that every 50-senator committee is as probable as any 
other . 

We first count \n\ , the number of possible committees. 
Since 50 senators are being chosen from 100 senators, there are 



|n| - 



fioo' 

50. 



committees . Let A be the event "Massachusetts is not represented 
The committees in A consist of 50 senators chosen from the 98 
non -Massachusetts senators. Therefore, 



A| = 



'98 
50 



Hence P (A) = 



(")/(■ 



00 
50 



(98) 50 / (100) 50 



"5U 



(98) 



50 



TOT 



(100) 



50 



98-97.. . . -49 



50-49 



100-99 .98 ... . .51 100-99 



0.247. 



So the answer to our original question is that Massachusetts is 
represented with probability 1 - 0.247 - 0.753 or 3 to 1 odds 
This seems quite fair. 

Now consider the event A = "every state is represented." 

Each committee in A is formed by making 50 independent choices 

50 

from the 50 state delegations. Hence |A| =2 and so 



P(A) - 2 5 °l 



50 /f 100 

{ 50 



= 10' 



14 



i.e., essentially impossible. By this criterion random choice 
is not a fair way to choose a committee. 

We computed the above probability by using Stirling's 
Formula as follows: 



2 50 


2 50 


2 50 .(50!) 2 


100 


" ( 100! 


100! 


[50 J 


50!50! 





,50 /c „50 -50 



100 100 e- 100 /Z^^ 



9 50 ^lQO -100 „ Kr . , n r , n ^100 
2 »50 »e •2tt»50 _ ~50 50 

,100 -100 „r= 1 '{T0U j 



100 



•/27-10 



• /2~7T • 5 



_ 9 50 100 o -50 , «- in -15 in in ~l^ 

2 • (%) -5/27: - 2 • 5/2 it - 10 -10 = 10 



The last approximation above is quite rough. We used only that 
2^ = 1000 and that it is about 3. All we required was 
the order of magnitude of the answer. Using a calculator one 
gets the more exact answer: 

1.113 x 10" 14 

Compare this with the following answer obtained (with much more 
effort) without using Stirling's Formula and correct to 5 
decimal places: 

1.11595 x 10" 14 
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4* Indistinguishability . 



If we roll three dice, the number of possible outcomes is the 
number of placements of three balls in six boxes: 216. Suppose 
now that the balls are photons and the boxes are six possible 
states. Now how many placements are there? The answer is rather 
surprising: only 56. If we consider electrons instead of photons 
the answer is even smaller: 20. Moreover, if the six possible 
states have the same energy, then the 56 states for photons are 
equally likely. The fact that subatomic particles do not behave 
as tiny hard balls is one of the major discoveries of physics 
in this century. The counting problems one encounters in physics 
are more difficult than those of the previous sections, but a deep 
understanding of the physics of subatomic particles requires the 
concepts we present here. 

The reason that photons do not behave as dice or balls is a 
consequence of a propetty known as indis tinguishability . In other 
words, if two photons were interchanged but the rest of the 
configuration is left unchanged, then the new configuration is 
identical to the old one. Moreover, given a set of various 
possible different configurations, all having the same energy, 
each is as likely to occur as any other. For simplicity, suppose 
that we have two photons and three states. There are 6 possible 
configurations : 
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number of photons in state #1 in state #2 in state #3 

2 

1 10 

1 1 

2 

11 

2 



Particles which are indistinguishable and for which any number 
of particles can occupy a given state are called bosons , and we 
say they obey Bo se - Einstein statistics . For example, photons 
and hydrogen atoms are bosons. 

Electrons differ from photons in that two electrons cannot 
have the same state. As a result, if our configuration consists 
of two electrons occupying three states, there are only three 
possible configurations. The fact that two electrons cannot 
occupy the same state is called the Pauli exclusion principle . 
Particles which are indistinguishable and which obey the Pauli 
exclusion principle are called fermions , and we say they obey 
Fermi - Dirac statistics . For example, electrons, neutrons and 
protons are fermions. 



Fermi - Dirac Statisti cs : Subsets . 

We first count Fermi -Dirac configurations. A Fermi -Dirac 

configuration means simply that certain of the states are occupied 

(or "filled") and the rest are not. Thus a Fermi-Dirac configuration 
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is a subset of the states . Since there are n states and k 
particles, there are (£) possible configurations. 

Bose - Einstein statistics : multisets . 

In order to count placements obeying Bose-Einstein statistics, 
we will use the second rule of counting. However the ordered 
object corresponding to these placements is a new concept: the 
disposition. A disposition if k balls in n boxes is a 
placement together with the additional information of an arrange- 
ment in some order of the balls placed in each box. Another model 
for a disposition is a set of k flags arranged on n flagpoles . 




Two different dispositions of three flags on two flagpoles. 

Yet another model is that of a disposition of k cars in n 
traffic lanes on a turnpike. 

We can count the number of dispositions using the first 
rule of counting. The first ball can be placed n ways. The 
second ball has n+1 choices : either we place it in an un- 
occupied box (n-1 choices) or we place it before or after the 
first ball in the occupied box (2 choices) . The third ball has 
n+2 choices. If there are n-2 unoccupied boxes, we can place 
the third ball in two ways into each of the two occupied boxes. 

If there are n-1 unoccupied boxes, we can place the third ball 
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in three ways into the occupied box. In general each newly- 
placed ball creates one more "box" for the next ball. 



t \ 
4/ , * 



J 





or 




First ball 



second ball 



third ball 



By the first rule of counting, there are n (n+1 ) . . . (n+k-1 ) 

dispositions of k balls in n boxes. We call this the rising 
(k) 

factorial or n As with the lower factorial, the k denotes 

the number of factors. 

We now consider an alternative way to count dispositions. 
We first specify the occupancy numbers, then count all dispositions 
for the given set of occupancy numbers. To see this better, we 
consider the example of three balls into 2 boxes. There are four 
choices for the occupancy numbers : 



1 I! 
3 

2 1 Possible occupancy numbers for placing 
12 ^ balls into 2 boxes 

3 



We now enumerate the dispositions for each set of occupancy 
numbers : 
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e 1 = 3, e 2 = e 1 = 2, e 2 = i e i = 1 ' 9 2 = 2 e 1 = o, e 2 =3 



LiiiJ L_J LiiJ LU - j .) Li 3 



L_J L 123 J 



213 



L_ 2 ij Li. 



2 13 



213 



132 



Li L_J 



13 



J L_Li iLi _ 32 _J 



312 



31 



J L 



1 3 j 5 12 

ImummwJ mmim 



312 



231 



J L J 



23 



J LiJ 



s 2 , ,31 
L i 



t 231 



; 321 | i , ; 32 I j 1 [ 



21 



_J Li 2 !! 



The dispositions of 3 balls into 2 boxes 



A glance at the table reveals the following general fact: the 
number of dispositions of k balls into n boxes having a given 
set of occupation numbers is precisely k! = OO^i tne number 
of permutations of k balls. 

By the second rule of counting, the number of sets of 

n (k > 

occupation numbers is -^-j — . Now a Bose-Eins tein configuration 

means that each state has a certain number of particles in it 

(possibly none). Thus a Bose -Einstein configuration is nothing 

n (k) 

but a set of occupation numbers. We will write ^jj^ f° r ^-j — 
by analogy with the binomial coefficient and call it the 
multiset coefficient . We use this name because we may interpret 
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a Bose-Einstein configuration as being a multiset : a set of 
elements together with a nonnegative multiplicity for each element. 
For example, {a, c , a, a, c , f , g) is a multiset and not a set. The 
classical terminology for multiset is combination . 

Monomials . Consider for example how many monomials of degree 
16 can be made with 10 variables x , ....x^q. Each such monomial 
is a product of 16 x^'s where some must be repeated and the order 
does not matter. Thus each is a 16 -multisubset of the set 

>in -in (16) 

{x x x 1Q }. Therefore there are <^> = 1U 16! = 2,042,975 

monomials . 



Particles . What is the probability that a random configuration 
of k particles in n states will have occupation numbers 
l'''' ,Q n ? The answer will depend on the "physics" of the problem, 
i.e., which statistics is to be used: Maxwell-Bol tzmann , Fermi- 
Dirac or Bose-Einstein. Let A be this event. 



k -k 

(1) Maxwell -Bo 1 tzmann . P(A) = ( )n 

u l ' ' ■ ' ' u n 



(2) Fermi-Dirac. P(A) = fO if one of the Q ± is greater than 

otherwise 



k' 
nv-1 



(3) Bose-Einstein. P(A) = 



We try two examples: 3 particles in 5 states with occupation 

numbers (1,0,1,1,0) and (0,3,0,0,0). 
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(1,0,1,1,0) 



(0,3,0,0,0) 



(1) Maxwell -Bo ltzmann 

(2) Fermi -Dirac 

(3) Bose-Einstein 



6/125 = 0.048 
1/10 = 0.100 
1/35 = 0.029 



1/125 = 0.008 


1/35 - 0.029 



Notice how Bose-Einstein statistics "enhances" the probability 
of having multiple particles in a single state relative to 
Maxwell -Bo ltzmann statistics. 



Arbitrary placements 




kl 


Placements with at most one ball per box 




(n) k 
kl 



Table 3: Placements of k indistinguishable balls 
into n boxes . 



5* . Identities for Binomial and Multiset Coefficients . 

The coefficients (£) and satisfy a wealth of 

identities. Although these can be proved using the formulas, 
they can often be given combinatorial proofs . That is , we 
can prove them using only their definitions in terms of "balls 
into boxes . " 

1. (J 1 ) = ( n_1 ) + ( n_1 ) . To prove this we "mark" one of 
k k-1 k r 

the boxes, say the last one. Let: 
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ft = the set of all k-subsets of the n-set U 

B - the set of all k-subsets of the n-set U which contain 
the last box 

C = the set of all k-subsets of the n-set U which do not 
contain the last box. 



Q consists of all (unordered) k element samples from a 

population U of size n. B and C are the events 

"the sample has the last individual" and "the sample does not have 

the last individual." As events it is clear that ft = B \J C 

and BAC = 0. Therefore = |B| + |C|. Each of these is easy 

to count : 



= (£) by the definition of the binomial coefficient. 



B| = 



n — 1 

(, , ) since each subset in B consists of the last 
k-1 

box together with an arbitrary k-1 element subset of 
the other n-1 boxes . 

n — 1 

C| = ( i" ) since each subset in C consists of a k element 
i N k 

subset of the other n-1 boxes . 



Remember that Q, B and C are all sets of sets . So 
B f\ C = means that none of the sets in B are also in C. 
It does not mean that the sets in B are disjoint from those 
in C. 

Take for example n = 4, k = 3. The set of boxes is 
U = {1,2,3,4}. The events Q, B and C are: 
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ft ={{1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}} 
B = {{1,2,4}, {1,3,4}, {2,3,4}} 
C = {{1,2,3}} 

Then \Q\ = (3) = 4, |B| = (^) = 3, |C| = (3) = 1. Be careful 
to distinguish {2,3,4} from {{2,3,4}}, as for example the 
former has three elements but the latter has only one. Also be 
careful to distinguish subset o_f (as in "subset of U") from subset 
in (as in "subset in B") . 

2. = + ^k"^ * This is the multiset analogue of 

identity 1. We prove it similarly. ^]^-\) i- s tne number of k- 

multisubsets of the n boxes which contain at least one copy of 

the last box: every such multisubset is obtainable by choosing 

an arbitrary (k-1) -multisubset of the n boxes and then 

n— 1 

throwing in one more copy of the last box. ^ ~ ^ is the number 
of k-mul tisubsets of the n boxes which contain ri£ copy of the 
last box. 

Identity 1 gives rise to Pascal's triangle. Because of 
identity 2, multiset coefficients may also be arrayed in a 
Pascal-like triangle. 

3 * ^k+1^ = O + ^k"^ + --- + • We P r °ve this by classifying 

the (k+1) -subsets of a set U of n-fl boxes according to which 
is the lowest numbered box in the subset. We assume that U 

consists of boxes numbered 1,2 n+1 . Let 
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A = the set of all (k+1 ) -subsets of U 

= the set of all (k+1) -subsets of U which contain 1 

B 2 = the set of all (k+1) -subsets of U which contain 2 
but not 1 

* • k 

B = the set of all (k+1) -subsets of U which contain £ 
but not 1,2,...,£-1 

Then A = B^B^... and B.AB. = (if i f j). Therefore 
|A| = IB^I + |B 2 | + ... . Each of these is easy to count: 
l A l = ^k+l^ ; l B l' = O since every element of B 1 consists of 
box 1 together with a k-subset of the boxes numbered 2,3,...,n+l; 
n — 1 

| B 2 | = ( ^ ) ; and so on . 

4 * = ^o) + ^1^ + ' ' ' + ^k^' ThiS iS the multiset analogue 

of identity 3. We classify the k-mul tisubsets of an (n+l)-set 
according to how many copies of the last element of the (n+l)-set 
are in the mul ti sub set . 



5 (J 1 ) = JL (I 1 "!) . This one is easiest to prove directly 
x k' k x k-l 

from the formula; however it does have a combinatorial proof. 
Suppose we make a table of all k-subsets of an n-set U. For 
example, take n = 4, k = 3: 



®, Cl). © 

i;. CD, © 

3) . (D . © 
©. <D, © 

2.28 



We use k(£) entries when we make this table. We now count the 
number of entries in another way. Each element of the n-set U 
appears times in this table: once for every way of 

choosing k-1 elements from the remaining n-1 elements of U. 
Therefore there are also n entries in the table. Hence: 

k <k> = n( k-l ) - 

6< ^ = We leave the multiset analogue as an exercise. 

7 /Hs / n x Choosing a k-subset is the same as choosing its 
v k v n-k y ° 

complement, which has (n-k) elements, 
k . . 

3^ ( 1 + J) = £ ( 1 ) (J ) Both of these are proved the same way. 

Start with a set U of n = i+j boxes . 



1=0 



k 

9< = I ^Xi^ y Classify every k-element subset according 
^ k ' Q k — 

to the number of elements among the first 
i boxes, in which case k-JL of the elements will be among the 
last j boxes. Similarly for multisubset s . 

10> ^ + ^ + ^ + ^ = 2 n T ^ e i e ft hand side counts the number 

of subsets of an n-set. The right 
hand side is the number of placements of n distinguishable balls 
into two boxes, i.e., also counts the number of subsets of an n-set 
There is no multiset analogue. 



11. ^> = (-1) (^ n ) = ( £ ) The binomial and multiset 

coefficients make sense with n 
any real number. This identity is easily proved explicitly. A 
combinatorial proof reveals an alternative way to compute the 
multiset coefficient. The binomial coefficient ( n+ £ 1 ) counts 
the number of k-subsets of a set of n+k-1 boxes. For a given 
k-subset of the boxes, charge the boxes in the k-subset to balls 
and change the remaining n-1 boxes into vertical lines marking 
the boundaries of n new boxes. For example: 

I \ \ \ ^ w t i ^\ the 3-subset consisting of 2, 3, and 5 

1 2 3 4 5 

O c: O becomes 3 balls and 2 boundary lines 

1 1 - and thence 3 (indistinguishable) balls 

in 3 boxes. 



_, _ .n+k-1 N _ ynv 

Therefore ( k ) - < k > • 

12. = Use the ab ove representation of then 

interchange balls and vertical lines. 



6. Random Integers 

It is intuitively obvious that if we choose an integer 

"at random" it will be even with probability 1/2, divisible 

by 3 with probability 1/3 and so on. Furthermore, if p and q 
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are different prime numbers, then an integer chosen at random 
will be divisible by both p and q with probability l/pq, since 
it will be so if and only if it is divisible by the product pq . 
therefore divisibility by p and q are independent events. All 
this sounds reasonable except that it is not clear how to make 
sense out of the concept of a "randomly chosen integer" in 
such a way that every integer is equally likely. 

The naive approach is the notion of arithmetic density . 
Namely let q - {1,2,3,...} be the sample space consisting of 
all integers, and we take the events to be arbitrary subsets 
of . For an event A C q , the arithmetic density of A is 



d(A) = , im H1.2....,n}ftA | 



n 
n~M» 



if it exists. For example if D is the event "n is divisible 

by p" or equivalently as a set, D p = {p,2p ,3p, . . . } , then it 

is obvious that d (D ) = - . Moreover if p and q are different 

P P 

primes then d(D a D ) = — . Unfortunately there are several 

* p q pq 

problems with this definition. First of all, d is not de- 
fined on all events. Secondly, d is not a probability density 

even where it is defined. For example if we decompose the 

set of even integers D 2 into its individual elements 

D 2 = {2} u {4} U {6}u ... 



we get d(D 2 ) = 1/2 but d({2}) + d({4})+... = 0. 
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As an example of an event on which d is not defined 
consider the event "the first digit of n is 1". Call this 
event F]_. Then 



F 1 = {1,10,11,. ..,19,100,101,. ..,199,1000,...}. 
|{l,...,n}f* F 1 | 

In this case forever wanders between and 

n 9 

5 

and never "settles down" to any limiting value as n->-«>. 

We shall describe a "better" approach to this problem 
which is nevertheless far from being a complete answer. 

00 

Recall from calculus that the series E — converges 

n=l n 

when s>l and diverges when s<l. (The usual way one shows 

this is the integral test.) The value of this series when 

s>l is written c (s) and is a famous function called the 

Riemann zeta function . Computing values of this function 

v 2 

is quxte difficult, for example C (2) = . However, we 
shall only need the fact that C(s) exists. 

We now define for an event A C Q , the Dirichlet de nsity 

of A with parameter s to be P (A) - z — • — 7— r- • 

s n e A>n s 5 {s) 

The Dirichlet densities are easily checked to be probability 

1 1 00 1 1 

measures. For example P g (Q) = Z — • -rry = S — r-TgT 

n 5 n=l n ^ K ' 



5 (s) 
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Let us compute the probability of the event D , 

IT 

"n is divisible by p" . P (D ) = z ^ 'TTsT 

n e D D n ^ > 



, 1 ^ 1 . . 1 _ 1 ,1 . 1 , . 1 _ 1 1 

= ( + + . . . ) . — 7 — r- = ( + —T + . . . ) • — 7 — r — ~ • r (S) ■> — r 

p S (2p) S p s I s 2 S HTiT p s 5 U 



— . Similarly, for distinct primes p and q, P (D ^ D ) 
p s 

1 = P (D )p (d ). Therefore while the Dirichlet density 



(pq) 



s s p' s ' q 



of D is not the intuitively expected value i- , it is 
p P 

nevertheless true that being divisible by different primes 
are independent events . 

We find ourselves in a quandary. The notion of arith- 
metic density is intuitive, but it is not a probability. 
On the other hand the Dirichlet densities are probabilities, 
but they depend on a parameter s whose meaning is not easy 
to explain. Moreover the Dirichlet densities assign the 
events D^ the "wrong" probability. 

We get out of this quandary by a simple expedient: take 

the limit Urn P g (A) . One can prove, although it is very 
s+1 

difficult to do so, that if the event A has an arithmetic 

density, then d (A) = Jlim P (A) . Moreover events such as 

s+1 s 

now have a density, for one can show that Jlim p s ( F i) = l°9io ^ 

s->l 

These probabilities have many useful applications in the 
theory of numbers. 



7. Exercises for 

Chapter II Finite Processes 

The Bas ic Models 

1. Klip a coin three times. How many ways can this be done? 
List them. Convert each to the corresponding placement of 3 
balls into 2 boxes. Do any of these placements satisfy the 
exclusion principle? 

2. Roll a die twice. List in a column the ways that this can 
be done. In the next column list the corresponding placements 
of 2 balls into 6 boxes. Mark the ones which satisfy the ex- 
clusion principle. In a third column list the corresponding 
2-letter words, using the alphabet {A, B 3 C, D,E 3 F} . 

3. You are interviewing families in a certain district. In 
order to ascertain the opinion held by a given family you sample 
two persons from the family. Recognizing that the order matters 
which the two persons from one family are interviewed, how many 
ways can one sample two persons from a six person family? List 
the ways and compare with the lists in exercise 2 above. If 
the two persons are interviewed simultaneously so that order no 
longer matters, how many ways can one sample two persons from 

a 6-person family? 

4. Return to exercise 2. In a fourth column list the occupation 
numbers of the six boxes. 
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The Rules of Counting and Stirling 1 s Formula 

5. A small college has a soccer team that plays eight games 
during its season. In how many ways can the team end its 
season with five wins, two losses and one tie? Use a multi- 
nomial coefficient. 

6. Ten students are travelling home from college in Los Angeles 
to their homes in New York City. Among them they have two cars, 
each of which will hold six passengers. How many ways can they 
distribute themselves in the two cars. 

The following two problems require a hand calculator. 

7. Compute the order of magnitude of 1000!, i.e., compute the 
integer n for which 1000! is approximately equal to 10 n . 
[Use a hand calculator and Stirling's Formula to compute the 
approximate value of log-^ ( 1000 ! ) . ] 

8. How many ways can a 100-member senate be selected from a 
country having 300 ,000,000 inhabitants? 

The Finite Uniform Probability Measure 

9. Have the students in your probability class call out their 
birthdays until someone realizes there is a match. Record how 
many birthdays were called out. We will return to this problem 
in Exercise III. %5. 

10. Give a formula for the probability that in a class of n 
students at least two have adjacent or identical birthdays. 
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Ignore leap years. Calculate this probability using a hand 
calculator for n = 10, 15, 20 and 25. 

11. Compute the probabilities for two dice to show n points, 
2 <_ n _< 12. Do the same for three dice. 

12. It is said that the Earl of Yarborough used to bet 1000 
to 1 against being dealt a hand of 13 cards containing no card 
higher than 9 in the whist or bridge order. Did he have a good 
bet? In bridge the cards are ranked in each suit from 2 (the 
lowest) through 10, followed by the Jack, Queen, King and Ace 
in this order. 

13. May the best team win! Let us suppose that the two teams 
that meet in the World Series are closely matched: the better 
team wins a given game with probability 0.55- What is the 
probability that the better team will win the World Series? Do 
this as follows. Treat the games as tosses of a biased coin. 
Express the event ,; the better team wins" in terms of elementary 
Bernoulli events, and then compute the probability. We consider 
in exercise VIII. xx how long a series of games is necessary in 
order to be reasonably certain that the best team will win. 

14. Although Robin Hood is an excellent archer, getting a 
"bullseye" nine times out of ten, he is facing stiff opposition 
in the tournament. To win he finds that he must get at least four 
bullseyes with his next five arrows. However, if he gets five 
bullseyes, he runs the risk of exposing his identity to the 
sheriff. Assume that if he wishes to miss the bullseye he can 

do so with probability 1. What is the probability that Robin wins 

the tournament? 
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15. A smuggler is hoping to avoid detection by customs officials 
by mixing some illegal drug tablets in a bottle containing some 
innocuous vitamin pills. Only 5% of the tablets are illegal in 

a jar containing 400 tablets. The customs official tests five 
of the tablets. What is the probability that he catches the 
smuggler? [Answer: about 22. 7#] Is this a reasonable way 
to make a living? 

16. Every evening a man either visits his mother , who lives 
downtown, or visits his girl friend, who lives uptown (but not 
both). In order to be completely fair, he goes to the bus stop 
every evening at a random time and takes either the uptown or 
the downtown bus, whichever comes first. As it happens each of 
the two kinds of buses stops at the bus stop every 15 minutes 
with perfect regularity (according to a fixed schedule). Yet 
he visits his mother only around twice each month. Why? 

17. In a small college, the members of a certain Board are 
chosen randomly each month from the entire student body. Two 
seniors who have never served on the Board complain that they 
have been deliberately excluded from the Board because of their 
radical attitudes. Do they have a case? There are 1000 students 
in the college and the Board consists of 50 students chosen eight 
times every year. 

18. The smuggler of exercise 15 passes through customs with no 
difficulty even though they test 15 tablets. But upon reaching 
home he discovers to his dismay that he accidentally put too 
many illegal drug tablets in with the vitamin pills, for he finds 
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that 48 of the remaining 385 tablets are illegal. Does he have 
reason to be suspicious? The question he should ask is the 
following: given that he packed exactly 48 illegal pills, what 
is the probability that none of the 15 tested were illegal? 

19. Using Stirling's formula, compute the probability that a 
coin tossed 200 times comes up heads exactly half of the time. 
Similarly what is the probability that in 600 rolls of a die, 
each face shows up exactly 100 times? 

20. The following is the full description of the game of CRAPS . 
On the first roll of a pair of dice, 7 and 11 win, while 2, 3 
and 12 lose. If none of these occur, the number of dots showing 
is called the "point," and the game continues. On every sub- 
sequent roll, the point wins, 7 loses and all other rolls cause 
the game to continue. You are the shooters what is your prob- 
ability of winning? 

21. Compute the probability of each of the following kinds of 
poker hand, assuming that every five-card poker hand is equally 
likely. Note that the kinds of hands listed below are pairwise 
disjoint. For example, in normal terminology a straight hand does 
not include the straight flush as a special case. 





kind of hand 


def init ion 


(a) 


"nothing" 


none of (b ) - ( j ) 


(b) 


one pair 


two cards of the same rank 


(c) 


two pair 


two cards of one rank and two of another 


(d) 


three- of- a-kind 


three cards of the same rank 


(e) 


straight 


ranks in ascending order (ace may be low 
card or high card but not both at once) 
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(f) full house 

(g) flush 

(h) straight flush 
(1) f our-of-a-kind 
(j) royal flush 



three of one rank and two of another 

all cards of the same suit 

both (e) and (g) 

four cards of the same rank 

(h) with ace high 



22. It is an old Chinese custom to play a dice game in which 
six dice are rolled and prizes are awarded according to the 
pattern of the faces shown, ranging from "all faces the same" 
(highest prize) to "all faces different." List the possible 
patterns obtainable and compute the probabilities. Do you notice 
any surprises? 

23. Some environmentalists want to estimate the number of white- 
fish in a small lake. They do this as follows. First 50 whitefish 
are caught, tagged and returned to the lake. Some time later 
another 50 are caught and they find 3 tagged ones. For each n 
compute the probability that this could happen if there are n 
whitefish in the lake. For which n is this probability the 
highest? Is this a reasonable estimate for n ? 

24. A group of astrologers has, in the past few years, cast some 
20,000 horoscopes. Consider only the positions (houses) of the 
sun, the moon, Mercury, Venus, Earth, Mars, Jupiter and Saturn. 
There are twelve houses in the Zodiac. Assuming complete random- 
ness, what is the probability that at least two of the horoscopes 
were the same? 



2.39 



25. In a chess championship, a certain number N of games are 
specified in advance. The current champion must win N games in 
order to retain the championship, while the challenger must win 
more than N in order to unseat the champion. The challenger is 
somewhat weaker than the champion, being able to win only a dozen 
games out of every 25 games which do not end in a tie. If the 
challenger is allowed to choose the number N , what should the 
challenger choose? [Answer: 12]. Reference: Fox, Math . Teacher 
5^ (1961) , 411-^412 . 

26. The three-person duel is a difficult situation to analyze in 
full generality. We consider just a simple special case. Three 
individuals, X, Y and Z, hate each other so much they decide to 
have a duel only one of which can survive. They stand at the 
corners of an equilateral triangle. The probability of a hit by 
each of the three participants is 0.5? 0.75 and 1, respectively. 
For this reason they decide that they will each shoot at whomever 
they wish, taking turns cyclically starting with X and continuing 
with Y, then Z, then X again and so on. All hits are assumed to 
be fatal. What strategy should each employ, and what are their 
probabilities of survival? 

History of Probability 

Historically, the modern theory of probability can be said 

to have begun as a result of a famous correspondence between the 

mathematicians Blaise Pascal (1623-62) and Pierre de Fermat (1601- 

65). Their correspondence came about as a result of problems put 

to Pascal by the Chevalier de Mere, a noted gambler of the time, 

We give below two of these problems. 
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27. Two gamblers are playing a game which is interrupted. How 
should the stake be divided? The winner of the game was to be 
the one who first won M deals out of 7- One gambler has so far 
won 1 deal and the other 2 deals. They agree to divide tne stake 
according to the probability each had of winning the game, and 
this probability is to be computed by assuming that each player 
has equal probability of winning a given game. Express the event 
that the first gambler wins in terms of elementary Bernoulli 
events. Then compute the probability. [Answer: 5/16], 

28. Chevalier de Mere apparently 
believed that it is just as probable to show at least one six 
in four throws of a single die as it is to show at least one 
double-six in twenty-four throws of a pair of dice. However, de 
Mere computed the probabilities of these two events and found that 
one was slightly above, the other slightly below, 0.5. What are 
the exact probabilities? In exercise IV. ££, we will consider the 
likelihood that de Mere" could have found the distinction between 
these two probabilities empirically. 

29. In what was apparently Isaac Newton's only excursion into 
probability, he answered a question put to him by Samuel Pepys . 
The problem was to determine which is more likely, showing at 
least one six in 6 throws of a die, at least two sixes in 12 
throws or at least three sixes in 18 throws. Compute these 
probabilities and consider the general question of the probability 
of showing at least n sixes in 6n throws of a die. 
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Indist inguishabi lit y 

30*. Suppose we have a physical system having three energy 
levels and two states per energy level (for a total of six 
states). If two electrons are in the configuration, what is 
the probability that they occupy the lowest two energy levels 
(one in each level)? Consider the same question for two photons 
and two Maxwell-Bolt zmann particles. Since the states do not 
have the same energy, the states are not "equiprobable . Assume 
that the probability of one particle being in a state is pro- 
portional to e~ E where E is the energy of the state. In this 
problem suppose that the three energy levels have, respective 
energies 1, 2, and 3* 

In the following problem we admittedly oversimplify a bit 
too much, but it does illustrate some of the ideas and techniques 
of modern Physics. 

31*. Consider a small piece of metal at ordinary temperatures. 
It forms a crystal with the nuclei of its atoms appearing in a 
regular fashion throughout the solid. Most of the electrons may 
be regarded as being bound to some one nucleus. Some of the 
electrons, the ones in the outermost orbitals of an atom, have 
more freedom of movement. Call these the valence electrons. 
The outermost orbitals of a given atom form an almost continuous 
band. Let us suppose that the valence electrons act as bosons 
in this environment with any number being allowed in the set of 
outermost orbitals of one given atom. Let us suppose also that 
the outermost orbitals of one given atom all have the same energy. 
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(Neither of these assumptions is actually true.) Let k be the 

number of atoms and let n be the number of valence electrons. 

Compute the distribution of 0-j_ , the number of valence electrons 

occupying the outermost orbitals of one specific atom. Now in 

an actual macroscopic piece of metal, n and k are very large 

23 

(being on the order of 10 ) and so cannot be measured exactly. 

However, the ratio X = — is usually not too difficult to find. 

k 

Since n and k are so large, we may regard them as being 

infinite. The distribution of Q-^ is then approximated by 

letting n and k tend to infinity but in such a way that the 

ratio X = -f 1 is held fixed. Find this limiting distribution, 
k 

Such a distribution can be measured experimentally and used to 

compute X as well as to test our model. 

Identities 

32. Prove formally that = ^k - 1^ 

33*. Give combinat orial proofs of the above identity as well as 
the identity /"\ = { k + ) 

V - 7 

34*. Give a combinatorial proof of the following identity" 

n2 n_1 = + 2 (gj + ••• + <n-l)| „-l 

Random Integers 

3c* Compute Urn PgCF^ for i = 2,3,. ..,9. Now pick out 
s-*l 

100 addresses at random from a phone book and tabulate the 
number having each of the 9 possible first digits (ignore 
addresses other than natural number addresses) . Do these 
fit with the predicted probabilities? Try 1000 addresses. 
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Chapter III Random Variables 

A random variable is a new way of answering 
questions about nature. For example, suppose we toss a coin. 
How long will it take to get the first head? How can one 
answer such a question? Sometimes the first head will appear 
on the first toss, sometimes on the next and so on. Clearly 
we cannot answer such a question with a single number. The 
originality of the probabilistic point of view is that it 
answers such a question with a series of possible answers, 
each with its own probability. 

The intuitive idea of a random variable is that it is 
the strengthening of the notion of a variable. Recall from 
calculus and algebra that a variable is a symbol together with 
a set over which the symbol ranges. For example in calculus 
one often says "let x be a variable ranging over the real 
numbers" or more succinctly "let x be a real variable." Now 
a random variable (or R.V. for short) is a variable together 
with the probability that it takes each of its possible values. 

In particular an integer random variable is a variable 

n ranging over the integers together with the probability p n 

that it takes the value n . Implicit in this is that T. P n = 1» 

n 

which means that the random variable always takes some value 
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or other. Some of the p n can be zero, which means that these 
integers do not occur as values of the random variable. For 
example, if p n = whenever n<0 , then the random variable is 
said to be positive , i.e. it takes only positive integral 
values . 

1* Integer Random Variables 

We're now ready for the precise mathematical defini- 
tion. Don't be surprised if at first this notion doesn't ap- 
pear to match what we've just been led to expect. It has 
taken an enormous amount of time and effort to make this 
notion rigorous so it will require some effort and many ex- 
amples to make this concept clear. 

An integer random variable is a function X defined on a 
sample space that takes only integer values. Namely, for 
every sample point gjgQ, X(oj) is an integer. The ( probability ) 
distribution of X is the sequence of numbers p^ such that p n 
is the probability of the event "X equals n" . The event "X 
equals n" is usually written (X=n) . As a subset of Q , this 
event is (X=n) = {meQ :X (w) =n} . We shall generally avoid 
writing out this set explicitly each time. One should develop 
an intuitive feeling for the event (X=n) . 
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Of course we have implicitly assumed that the subsets 
(X=n) really are events of the sample space Q . This is a 
technical point that will never be of direct concern to us. 
Suffice it to say that a fully rigorous definition of an 
integer random variable is: a function X on a sample space 0, 
such that its values are all integers and such that the subsets 
(X=n) are all events of Q . 

The probability distribution of an integer R.V. X always 

satisfies p > for all n and T. n = 1 . m he former property 
^n — „ n 

n 

expresses the fact that the p n are probabilities, while the 
latter says that X always takes some value. 

The intuitive idea of a random variable relates to the 
precise definition of a random variable in the following way. 
Whenever we have some measurement with probabilities, look 
for a sample space and a function on it. The random variable 
then really comes from observing some phenomenon on this sample 
space. The fact that we only had a probability distribution 
at first arose from the fact that we had forgotten about the 
phenomenon from which the measurement came. 

Of course all this means little until we have seen ex- 
amples. 
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A .The Berno ulli Process : tossing a coin 

Recall that Q is the set of all infinite sequences of 

zeros and ones corresponding to the tosses of a biased coin 

with probability p of coming up heads (or one) and q coming 

up tails (or zero) . Let be the waiting time for the 

first head. In other words we ask the question: how long 

do we have to wait to get the first head? The answer is a 

probability distribution p , where p n is the probability 

th 

that we must wait until the n toss to get the first head. 
In terms of the terminology of R.V.'s: 



P n = P(W X = n) 



How 



can we compute this? Well, the event (V7^=n) is 



th 

the event: "at the n toss we get a head and the preceding 
n-1 tosses are all tails 11 . In terms of elementary events: 



(W x = n) = T i r%T 2 A " ' "^n-l^n " 
n~ "1 

Therefore p^ = P (W^-n) = q p. 

Just for once let us satisfy ourselves that E p =1: 

1 *n 

n 

00 00 ^ 

l p = I q n p = p I q n ~~ = p--= = P • — = 1. So it checks 

n n n=l n-1 q * p 
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Of course it isn't really necessary that we do this. The 
very definition of a probability distribution requires that 
it sum to 1. As we shall see probability theory furnishes a 
new way to perform some very complicated infinite sums simply 
by observing that the terms are related to the probability 
distribution of some integer random variable. 

Notice that in understanding W-^ as a random variable 
we worked completely probabilistically. We never spoke of 
W 1 as a function. What is VI ^ as a function? For each 
weQ, W^(w) is the first position of the sequence w such that 
at that position w has a 1. For example, W^ (00011011 ... ) 
is 4. However looking at W^ as a function is quite unnatural. 
One should try to think of purely probabilistically. In- 
deed, one might say that probability theory gives one a 
whole new way of looking at sets and functions. 

Consider another example. Let be the waiting time 
for the k th head. The event (W k =n) is the event: "a head 
occurs at the n toss and exactly k-1 heads occur during the 
preceding n-1 tosses." The probability distribution is: 

~ % /n-1. k-1 n-k /n-l N k n-k 

p n = P (w k =n) = ( k _ x )p q p = ( k _ x ) P q 

How does one see this? Well, the k-1 heads can occur in 
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any (k-1) -subset of the first n-1 tosses. There are such 

subsets. For each such subset, the probability of getting 

k-1 n-k 

heads in those positions and tails in the others is p q 

th 

Finally the probability of getting a head on the n toss is 
P. 

Needless to say it is not very easy to write an explicit 
expression for the events (W^=n) in terms of elementary events 
although that is implicit in our computation above. 

oo 

n— 1 k n— k 

Notice too that I ^k-l^ p ^ = 1, a fact that is not very 

n=k 

easy to prove directly. 

th 

Consider the event x n = ( 1 if n trial is 1 \ 

' if n th trial is ' 
th 

or more succinctly X is the n trial. The distribution of 

n 

X n is 



p Q = P(X n = 0) = q 



p x = P(X n = 1) = p 



and all other p are zero, 
n 

Next let S be the number of heads in the first n 
n 

tosses. The distribution of S R is: 
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Pv = p(s = k) = Op k q n ~ k 

k n k J - 1 



because the event (S n =k) means that k heads and n-k tails 

occur in the first n tosses. There are (jj) ways that the k 

heads can appear and each pattern has probability p k q n k of 

occurring. The fact that I p, = 1 is just the binomial 

k K 

theorem: 



^ ,n. k n-k , , »n ,n , 

£ ( v )p q = (p+ct) =1 = i . 

k=0 K 



Indeed this is a probabilistic proof of the binomial theorem. 

Incidentally the event (S n =k) is not the same as the event 

(W k =n) . The distinction is that (W k =n) requires that there 

th 

be k heads in the first n tosses and that the k head occur 
+■ v» 

at the n toss. ( s n = k) is on lY tne event that k heads occur 
in the first n tosses. The distinction is reflected in the 
formulas we found for the distributions in each case. 
Another way to represent S n is: 

S_ — X-i + X^"f~. . • *t*X_ • 
n 1 2. n 

This illustrates the fact that we may combine random 
variables using algebraic operations. After all, random 
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variables are functions on ft and as such may be added, sub- 
tracted, etc. Thus if X and Y are integer R.V.'s on ft , then, 
as a function t the random variable X+Y takes the value 
X(u))+Y((jj) on the sample point weft. For example (W =n) is 

the event (X,+...+X„ . = k-l)/\(X =1) = (s , = k-1)^ (X =1) . 

i n— x n n— 1 n 

Unfortunately the use of large quantities of symbolism tends 
to obscure the underlying simplicity of the question we 
asked. We shall try to avoid doing this if possible. 

Now consider the random variable T^, the length of the 
gap between the (k-1) s *" and k*"* 1 heads in the sequence of 
tosses . 



0) = 0001 00001 1 001. . . 



T 1 =W 1 T 2 T 3 T 4 



The T^ 1 s and 1 s are related to each other 



k k k-1 



W k = T l + T 2 + * " * +T k 



What is the distribution of T^? When we later have the 
notion of conditional probability we will have a very natural 
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way to compute this. However we can nevertheless easilv 
compute the distribution of T v because of the independence 
of the various tosses of the coin. In other words when 
computing P(Tj,=n) we may imagine that we start just after 
st 

the (k-1) head has been obtained. Therefore the distri- 
bution of T^. is 

_ » „ , „ /rT , n-1 
P n = P(T k =n) = PO^n) = q n, 

exactly the same distribution as that of T T ^. 

Notice that for k>l is, not the same random variable as 
, and yet their distributions are the same. How can this 

be? Actually we have already seen this phenomenon before but 
didn't notice it because it was too trivial an example: 
x l'^2'"'' are a ^ 3iff erent random variables, but they all have 
the same distributions. This phenomenon will occur fre- 
quently and is very important. 

Definition . Two integer random variables X and Y are said 
to be equidistributed or stochastically identical when 

P (X = n) = P (Y - n) for all integers n . 
Thus for example W 1 and T, are equidistributed R.V.'s. 
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Similarly the X n are equidistributed R.V.'s. Although X 1 

and X„ measure completely different phenomena, they have 
2 

exactly the same probabilistic structure. 

B. The Bernoulli Process : random walk 

Consider the random variables X^ given by: 



X' = 
n 



1 if the n th trial is 1 



-1 if the n th trial is 



X and X' are closely related: X' = 2X -1. However if we 
n n J n n 

form the random variable analogous to S n we measure a quite 
different phenomenon. Let = X^+...+X^, then is the 
position of a random walk after n steps: a step to the 
right gives +1 , a step to the left gives -1 , so the sum of 
the first n steps is the position at that time. 

What is the probability distribution of S^? This cal- 
culation is a good example of a "perturbation" (or change 
of variables) applied to a model. We want to compute 

p(S'=x). Here we use x for an integer; think of it as a 
n 

point on the x-axis. Let h be the number of heads and t 
the number of tails, both during the first n tosses. Then: 



x = h-t and n = h+t . 



Solving for h and t gives: 
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h = ^-(x+n) and t = ^(n-x) . 



Therefore : 



P(S_1 = x) = P(S n = j(x + n) ) = 



1 , / n v l (x+n) 7 (n " x) 



n 



(i(x+n)) 



C. Independence and Joint Distributions 

Recall that two events A and B are independent when 
P(AaB) = P(A)P(B). This definition is abstracted from ex- 
perience; as for example when tossing a coin, the second time 
the coin is tossed, it doesn't remember what happened the 
first time. We extend this notion to random variables. In- 
tuitively two random variables X and Y are independent if the 
measurement of one doesn't influence the measurement of the 
other. In other words the events expressible in terms of X 
are independent of those expressible in terms of Y. T ^e now 
make this precise. 

Definition. Two integer random variables X and Y are inde - 
pendent when 

P( (X= ni )n(Y=n 2 ) ) = P(X=n 1 )P(Y=n 2 ) 
for every pair of integers n 1 ,n 2 - 



We illustrate this with our standard example: the 
Bernoulli process. and X R are independent when k^n. 

This is obvious from the definition of the Bernoulli process. 

Less obvious is that and T n are independent when 
k^n. We check this for and T 2 . By previous computations, 

n 1 -l n 2" _1 
P ^ T l =n l^ = q p and P * T 2~ n 2* = q p * 

Now compute P ( (T 1 =n 1 ) a (T 2 =n 2 ) ) . The event (T^n^ r\ (T 2 =n 2 ) 
means that the first n^+n 2 tosses have precisely the pattern: 

00 ._. ._01 00 ._. ,.01 
n l n 2 

n 1 ~l n 2~l 

Therefore P ( (T^=n^)r\ (T 2 =n 2 ) ) = q p q p. Since 
P ( (T 1 =n 1 ) a (T 2 =n 2 ) ) is the same as P (T^-n^) P (T 2 =n 2 ) = 

n l~ 1 n 2~ 1 

q pq p, we conclude that T^ and T 2 are independent. 

On the other hand, W^, and W are not independent R.V.'s. 
This is intuitively obvious, but we will check it neverthe- 
less in the case of w and T*7 2 . We previously computed: 
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n -1 

P(W X = ni ) - q p 



n 2" 2 2 

P(W 2 = n 2 ) = (n 2 -l)q Z p . 



Now (W 1 «n 1 )/\ (W 2 =n 2 ) is the same as the event (T 1 =n 1 ) a (T 2 =n 2 -n 1 ) , 

both being the event that the first n 2 tosses have the pattern: 

^ 

00. . .01 00. . .01 



n l n 2~ n l 



Therefore, P ( (V7 1 =n 1 )/> (W 2 =n 2 ) ) =0 if n 2 <n, 

n 2" 2 2 . 
q p if n 2 >n 1 

Since P ( (W^n^A (W 2 =n 2 ) ) £ P(W 1 =n 1 ) P(W 2 =n 2 ) (in particular 

when n^>n 2 >2 one side is zero and the other is not) , and 
W 2 are not independent. In other words influences w 2 » 

When two R.V.'s are not independent, is there a way to 
measure the dependence of one of them on the other? In 
more common parlance, how do we measure the "correlation" 
of two R.V.'s: We measure this with the joint distribution 
of two random variables. 
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Definition . For two integer random variables X and Y, the 
joint distribution of X and Y is 



C n 1( n 2 - P((X = n l'*< Y = n 2>> • 



The numbers c cannot be computed in general from 

n l' n 2 

the individual distributions of X and Y. The joint distri- 
bution measures the total dependence of X and Y or equiva- 
lently the cause and effect of one R.V. on the other. 

Joint distributions have the following properties : 



(1) Z E c - 1 , i.e. something must happen 

n l n 2 ^'^ 



(2) I c ^ = P(X = n, ) . 
n 2 n l' n 2 ± 



(3) I c - P(Y = n 9 ) . 

n x n l' n 2 2 



The distributions of X and Y considered relative to their 
joint distribution are called the marginal distributions or 
simply the marginals . Despite the fancy terminology, the 
marginals are simply the distributions of X and Y with which 
we are already familiar. 
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Just as we have the joint distribution of two random 
variables, we can have the joint distribution of any finite 
collection of random variables. The formulas are so obvious 
that we won't bother to write them down explicitly. 

We now compute some examples. If X and Y are inde- 
pendent random variables with distributions p = P(X=n,) 

and r - P (Y = n~) , then their joint distribution is 

c n = P((X = n, M(Y « n 9 ) ) = P (X = n, ) P (Y = n~) = p n r 
n^,n2 L z 1 z 1 2 

Therefore the joint distribution of independent R.V.'s is 
the product of the marginals. 

Next consider the random variables W. and W, (j<k) . 
Their joint distribution is 



c = P((W. = n,)MW, = n 9 )). 

n, ,n j 1 k z 



Of course we must have n^<n2» The event (W_.=n^)A (W^=n2) 

n 2 means that we have j-1 heads in 

010... 11 001.... 1 the first n^-1 tosses and k-j-1 

* . * ' heads in the "gap" of length n -n 

n, . th v.th ^ 

th th 

head head between the j and k heads . 

Writing all this out gives: 
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/ Bl -lV n 1 - j / B 2- n l" 1 \ ^ 



/ n r 1 V n 2" n l" 1 \ k n 2" 



The total number of tosses involved is n 2 : exactly k of 
them are heads and n 2 ~k are tails. This furnishes a quick 
check that the exponents on p and on q are correct. 

As a final example, we compute the joint distribution of 
the first k waiting times. For n 1 <n 2 < . . . <n k , the joint distri- 
bution is: 



: = P( (W =n, )r> (W 9 =n 9 )n . . . a (W. =n. ) ) . 

n x ,n 2 , . . . ,n k 11 2 z k k 



This is actually quite easy to compute because there is only 

th 

one "way" to get the event (W^n^ A . . . MW k =n k ) up to the n k 
toss. Therefore: 



n^l n 2 -n 1 -l k n k ~k 

: n 1 ,n 2 ...,n k = P ( (W 1 =n ] _) r\ . . . rt(W k = n k ) ) = pq pq ...=p q 
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D . Fluctuations of Random Wa Iks 

Recall that the basic random variables of the Random 
walk sample space are x; for n=l,2 f ... . These are inde- 
pendent random variables taking values + 1 with probability 
p and q respectively. They represent the direction taken 
during the n th step of the random walk. The position of 
the random walk after the n th step is then the random variable 

S' = x' + X*+...+X'. We computed the distribution of in 
n 1 2 n 

general in section 1. For the special case of a symmetric 
random walk, 



We will write p(n,x) for the above probability. 

Note that S 1 takes onlv even values for even n and only 
n 

odd values for odd n. 

For the rest of this section we will consider only the 

1 

case of a symmetric random walk, i.e. one for which p - q - 3- 

First Passage Time and the Reflection Principle 

The event (S" = 0) means that the random walk has re- 
n 

turned to the origin after n steps. However, it could have 
returned many times before. When was the first time it 
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returned to the origin (or more generally any point a>0)? 

We answer this by computing the probability distribution 

of the random variable T , the time when the random walk 

a 

first encounters the point a , i.e. the first time n such 

that S* = a. 
n 

To compute the distribution of T we use an important 

a 

principle called the reflection principle . Consider the 

event C = "the random walk is at position x at time n 

n / a ,x 

and at some previous time was at position a 11 . The fol- 
lowing is the graph of a typical random walk in C : 

n # a . x 




Now observe that every random walk in C is necessarily 

r n , a , x 7 

at position a for a first time. We take each random walk in 

C and "Divot" or "reflect" it up to the first time that 

n,a,x 

it reaches position a: 
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In this way we get a random walk from 2a to x. Con- 
versely, any random walk from 2a to x necessarily crosses a 
at some time, so every random walk from 2a to x is uniquely 
determined in this way! Now shift the axis so that 2a be- 
comes the origin and x becomes the point x-2a. Then we 

conclude that P (C ) is the same as P(S' = x-2a) . By 

n , a f x n 

symmetry this is the same as P(S^ = 2a-x) . Thus 

P (C ) = o (n,2a-x) . 

n , a , x 

We are now ready to compute P (T =n) . We first note 

that (T =n) necessarily implies that the random walk moved 
a j - 

from a-1 to a at step n. Prior to step n the random walk 
never achieved position a but ends at position a-1 at step 
n-1. This is just the "complement" of the event C n _ 1 a a _ 1 
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I This random walk is at a-1 at time n-1 so it is in (s ' , =a-l) 
L N n-1 

It never reaches position a so it is not in C , , .1 

^ n-1, a, a-1 J 

More precisely it is the difference of events: 



(S' . = a-1) - C , - 
n-1 n-1, a, a-1 



Putting this all together: 



(T = n) = US' = a-1) - r , a ,) f) (X' = 1) 
a n— 1 n-l, a, a— 1 n 

This is the intersection of independent events. Therefore 



P(T =n) = P((S' ,=a-l) - C . ,) P(X'=1) 

a n-1 n— ±,a,a-l n 



n-1 "n-1, a, a-1 



T P((s: n =a-l) - C_ , , . 
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*]ow C , is a subevent of (S' , =a-l) so 

n-l,a,a-l n~L 



= j [p(n-l,a-l) - p (n-l,2a- (a-1) ) ] 
= j [p(n-l,a-D - ?(n-l,a+l)]. 



Maximum Position 

We next ask how far the random walk travels to the 

right (i.e. the maximum position achieved) . Let M be this 

maximum for an n-step random walk. Conveniently, the events 

C are iust what we need to comoute the distribution of 

n , a , x J 

M . Namelv, we use the same "trick" of subtracting one of 
n 

the r : but this time from another event o^ this kind, 

n , a ,x 

First note that the event C . , v is a subevent of 

n f a - !* j. f x 

C ; for if a random walk achieves oosition a+1 , then it 

n , a , x 

must have some time previouslv been at oosition a. Thus 

p(C -r ) = p(C ) - P(C ^ ) = o(n,2a-x) - 

1 n,a,x n,a+l,x 1 K n,a,x' n,a+l,x 

p(n,2a+2-x). But the event C x - C n , a+1/X ™eans that 

the random walk achieved position a but neve r achieved 
oosition a+1, i.e. the maximum achieved by the random 
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n 



Typical random walk in C -C 

n ,a ,x n,a+l,x 



walk is precisely a. The only distinction between this 

event and the event (M =a) is that the latter does not 

specify the ending point x specified by the former. So to 

get P(M=a) we add up the P (C a -C ) for all possible 

11 n^a^xn^a+x^x 

values of x. The permissible values of x range from a down 
to any reachable negative point on the x-axis. 
Thus P(M R =a) is this sum: 

p(n,2a-a) - p(n,2a+2-a) (x=a) 
+ p(n,2a- (a-1) ) - p (n, 2a+2- (a-1) ) (x=a-l) 
+ ... (x<a-l) 

which equals 
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p (n , a) - p (n , a+2 ) 
+ p(n,a+l) - p(n,a+3) 
+ p(n,a+2) - p(n,a+4) 
+ p(n,a+3) - p(n,a+5) 
+ . . . 

Cancelling in the obvious way, we get: 
P(M n =a) = p(n,a) + p(n,a+l). 

Note that for each a only one of the summancls on the right 
is nonzero. 

We summarize the computations in this table. 

T a = time of first passage to or through position a. 
P(T a =n) = j (p(n-l,a-l) - p(n-l,a+l)). 

Mjj = maximum position achieved up to time n. 
P(M n =a) = p(n,a) + p(n,a+l). 
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E. Expectations 

Definition . Suppose X is an integer R.V. with distribution 

p = P(X=n) . 
*n 

The expectation or mean or average value of X is 

E(X) = E n-p = £n-P(X = n) . 
n n 

It can happen that this sum does not exist. We won't worry 
about this. Implicit in any statement about expectations 
is the assumption that the expectations exist. 

t Vi 

For example if X is the n trial in the Bernoulli 
r n 

Drocess, E(X ) = l*p + 0«q = p. The expected or average 
n 

+■ v> 

value of the n toss is p. Needless to say X„ won't 

^ n 

ever take the value p (except in the trivial cases p ~ 0,1). 
Intuitively, however, if we perform a large number of trials 
and then average the results, we will get roughly p . 

Before we go on to other computations, we need the 
following important result: 

Basic Fact. For any two integer random variables , 

E (X+Y) = E(X) + E(Y) . 
The surprising thing about this fact is that it holds re- 
gardless of whether X and Y are independent or not. 
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Proof. Let c = P ( (X=n~, )/\ (Y=n,J ) be the joint distri- 
n^,n 2 l 2. 

bution of X and Y. Now X+Y is a new R.V. What is its 
distribution? Well, we must consider all possible ways that 
X+Y can take on a given value: q k = P (X+Y=k) = 

/ , P( (X=n.)A(Y=n 9 )) = Z c n . 

{n 1 ,n 2 :n 1 +n 2 =k} 1 z ni +n 2 =k n l' n 2 

Therefore the expectation of X+Y is: 



E(X+Y) = E kq = E k E c ^ 

k K k ni +n 2 =k n l' n 2 



= E Z (n 1 +n 2 )c n 
n l n 2 1 2 



= E E n, c + E E n»c 

_ ± n, , n« z n, . n« 

n^ n 2 1' 2 n-^ n 2 1' 2 

= E n 1 P(X=n 1 ) + E n 2 P(Y=n 2 ) 
n l n 2 

= E(X) + E(Y) . 
This completes the proof. 
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Notice that it should be intuitively obvious that the 

so the Basic Fact 



expectation of X+Y is I Z * n l +n 2^ c n n 

n 1 n 2 1' : 

is actually easier than the size of our proof suggests. 

Namely using probabilistic reasoning we proceed as follows. 

X+Y takes the "value" n,+n with probability c . Adding 

j. £• \ ' 2 

up all cases gives the expectation: 

E(X+Y) = Z Z (n,+nJc . 

n l n 2 l f 2 

Now split the sum and take marginals. The result is E (X) + E (Y) . 

Let's compute some expectations for the Bernoulli pro- 
cess. First we compute the "hard way" directly from the defini- 
tion, then we compute using the Basic Fact. 

Consider S , the number of successes in the first n 
trials. The distribution for S n is p k = (£)p k q n k . So 

n v k 

E(S ) = Zk p. = Z k( n )p q . Unfortunately we cannot 
n k K k=0 K 

simplify this very easily. On the other hand, S n = X-^+X 2 +. . . + x n - 

Hence, E( s n ) = E(X 1 ) + E (X 2 ) + . . ,+E (X R ) = np, since all of 
these have the same expectation: p. In addition we have 
shown that 
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E(S) = Z k( n )p k q n ~ k = n p , 
n k=0 k 

a fact that is not so easy to prove. 

th 

Next consider the waiting time for the k head, W^. 
The distribution for V? k is p n = (£~*)p k q n_k . Therefore 

E(W, ) = In p = 1 n( n ~:hp q 11 " . Again there is no easy 
n n=k 

way to compute this infinite sum. However, W k = T^+T2+ . . • +T j c • 
Hence, E(W k ) = E (T^ +. . .+E (T k ) . But all the T ir ...,T k are 
equidistributed so in particular they all have the same ex- 
pectation. Therefore E(W k ) = k E (T^) , and we need only 
compute one expectation: the expectation of T^ = W^. 
We shall have to resort to some trickery, but it still 
isn't too difficult. 

00 CO CO CO 

E (T, ) = Z n PtT^n) - I n q n ' L p = p Z 35: (q") = Pg=( 2 q") 
n=l n=l n=l M ^ n=l 

= p —(-3-) = p( — - — j) = p*^r = | • 
dq 1-q (1-q) P P 

Intuitively, it is quite reasonable that E(T 1 ) = 1/p for 
if p is large we don't expect to wait very long for a success, 
while if p is small we expect to wait a long time. As before 
we get the added bonus of an identity: 
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00 



E(W k ) = 



E n( 
n=k 




- k/p 



a fact that is quite hard to believe otherwise. 




The Inclusion - Exclusion Principle 



Imagine that we have a well shuffled deck of cards 



and that we turn the cards over one at a time. While doing 
this we call out the names of the cards in their unshuffled 
order (as in Bridge) , beginning with the deuce of clubs and 
ending with the ace of spades. What is the probability that 
none of the cards turned over match the name called out when 
it is turned over? The answer (to an accuracy of 10~ 15 ) is i- . 
This is strange for two reasons : it depends on the number e 
which shouldn't appear in a finite counting problem, and it 
doesn't depend on the number of cards in the deck. 

We shall prove this result and several others by an im- 
portant formula called the inclusion-exclusion principle. 
The proof of this principle will follow easily from the forma- 
lism of random variables. 

The abstract setting for the principle is the computation 
of the probability of the union of events in terms of the 
probabilities of the events and of their intersections. For 
example if we have two events A and B, then we know that 



P(AUB) = P(A) + P(B) - P(AaB). 



A 




B 
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If we refer to the diagram it is clear what this means: 
P (A) + P(B) "counts" P(AaB) twice. Thus we "include" 
P(A) and P(B) and then "exclude" P(AnB). 

For three events A , B and C we must include, exclude 
and then include once again: 



P(Ai>BoC) = P(A) + P(B) + P(C) - P(ArtB) - P(AaC) - P (BaC) + P(A«B*C) 

It is quite easy to think through the proof of this directlv. 
However for the general case it will take a bit more work. 
Here is the general formula: 



P (AjV A 2 ^ 



»A ) = E P(A.) - Z P(A.*A.) + I P(A.*A.*A ) 
n . i .. in •^•^i 1 J 
i i< ") J i<-i<k 



+ (-l) n+1 P(A 1 ^A 2 A . . ,oA n ) 



The Inclusion-Exclusion Principle 
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Note that the second sum is really a double sum over both 
i and j but subject to i< j , the third is a triple sum and 
so on . 

To prove this principle we introduce a special kind of 
integer random variable called an indicator . Let A be an 
event, the indicator of A is the integer random variable I A 
corresponding to the question "Did A happen?" More pre- 
cisely for any sample point weft, 



I A (to) - 



1 if weA 
if wjzfA 



One sometimes also sees the notation x A f° r the indicator. 

We have already encountered such a random variable before. 

th 

In the Bernoulli process, II is the event "the n toss is 

heads" and its indicator I TT is the random variable X 

H n 
n 

The probability distribution of the indicator I A is 

r F(A c ) for n - 
P n = " P (A) for n = 1 
otherwise 



Therefore the expectation of I A is E(I A ) = 0-p(A ) + 1*P(A) = P (A) . 
As a result of this we see that all probabilities may be re- 
duced to the computation of expectations; and one could dis- 
pense with sample spaces and events altogether and develop 
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probability theory using only random variables and expecta- 
tions . 

We now consider what happens when we add and multiply 

indicators. The easy operation is multiplication : 1,1^ - I. 

J 1 A B AoB 

should be obvious. Addition, however, is not so easy because 
the sum of indicators need not be an indicator: 1^ + I g 
takes value 2 on AnB. However if we put in a correction 
term we get an identity: I A + I B = * A B + "'"AaB" So wh: ^ e 
multiplication corresponds to intersection, addition does 
not quite correspond to union. 

The last operation we consider is complementation. 

Here the result is clear: I = 1 - 1^. This suggests what 

A 

we should do in qeneral to compute I_ .in terms 

^ ^ A n vA~ j . . . ^A^ 

i z n 

of the A^'s: convert to an intersection by using the DeMorgan 
law. Thus: 



= 1-1 

1 A.vA V . . .VA A^A^ 

1 z n 1 z n 



= 1 - I c I . . * I c 

A l A 2 A n 



= 1 - (1-1 )(!-!)... (1-1 ) 
12 n 
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We now multiply out the last expression as in high school 
algebra : 



= 1 - [1~E I + Z I A I A ..] 
i i i<j i j 



-HI 



- E 
i i<j 




+ . . . 



+ (-D 



n+1 




A. 



n 



1 




-EI 
i i<j 



+ . . .+ (-1) 



n+1 



A. 





n 



Finally we take the expectation of this expression using 
the Basic Fact of expectations. The result is the inclusion- 
exclusion principle. 

We now return to our first question. Think of the sit- 
uation as follows. Start with a new unshuffled deck and then 
shuffle it. The result is a random permutation of the un- 
shuffled deck. What is the probability that no card is in 
the same position in both the unshuffled and the shuffled 
decks? 

To be more precise consider the integers l,...,n 
instead of the 52 cards. The sample space is the set ft of all 
permutations of l,...,n. Thus = n! The notation for 

permutations is (7 . . *"". ) , where one should think of 



n 



the top row as the unshuffled integers and the lower row 
as the shuffled ones. A f ixpoint of a permutation is a 
number j so that i^ = j, i.e. the same number j appears twice 
in one column in our notation. For example let n = 3. There 
are 6 permutations with number of fixpoints as follows: 



permutation 



number of fixpoints 



123 
123 

123 
213 

123 
231 

123 
321 

123 
312 

123 
132 



Let F be the event "there is at least one f ixpoint". 

We want to compute P(F C ). Counting F directly is not very 

easy, but we can write F as the union of events that we can 

count quite easily. Let A^ be the event "i is a f ixpoint". 

Then F = A^A-u . . .UA . Since the A. are not disjoint we 
i £• n l J 

must apply the principle of inclusion-exclusion: 
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P(F) = EP(A i ) - E P(A i ^A.)+... 

i i< j 



Now an element of A^ has 1 as a fixpoint so it is just 
a permutation of {2,...,n}. Therefore |A^| = (n-1) ! and 
similarly |A^| = (n-1)! Any element of A^nA 2 has two fix- 
points so it is a permutation of {3,...,n}. So |a 1 aA 2 | = (n-2) 
More generally | A^ A 2 « . . . aA^ | = (n-k) i If we divide by n! we 

get the probabilities, e.g. P (A^O . . . ^A^.) = * = ^k^* 

Substituting these into our formula for P(F) gives us: 



P < F > = «;> -n^- ( 2» W7 + ( 3' tht 



(n) 1 (n) 2 1 (n) 3 1 

+ 



1! TnTY 2! Tn)^ 3! TnTj 



1 1 + 1 _ + (-l) n+1 L- 

rr 2i if * • * K } m 



From calculus you should immediately recognize this expres- 
sion as the beginning of the expansion for 1-e when x = -1. 
This expansion converges so extremely rapidly that it is 
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essentially l-e~ when, say, n is larger than 7. We 
conclude that 

p(F C ) s i to high accuracy (when n>7) . 

We consider another application. Suppose we have an 
infinite collection of balls labelled 1,2,3,..., and suppose 
we have n boxes. If we drop the balls into the boxes sequen- 
tially, how long do we have to wait until every box con- 
tains at least one ball? If this sounds devoid of physical 
interest consider the following mathematically equivalent 
statement. Suppose we have a molecular beam firing molecules 
at a target crystal. Assume that a molecule adheres to the 
crystal if it strikes an unoccupied lattice site and re- 
bounds (and is lost) if it strikes a previously occupied 
site. If we assume that the molecules are fired at random 
at the crystal sites, how long must we wait until all the 
crystal sites are covered? This problem and perturbations 
of it are very real problems in surface physics. 
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The answer to our question will of course be a 

probability distribution. Let W be the waiting time until 

all the boxes are occupied. We want to compute p^ = P(W<k). 

This is the probability that if we place k balls into n 

boxes, then all the boxes are occupied. Let A^ be the event 

"the i th box is emoty". Then (W<k) = A c /\A^a . . . *A C . By the 

— l z n 

inclusion-exclusion principle, 



P(W<k) = P(A^A^...nA c ) 
— L z n 



1 - P (A,0A o tf . . .V A ) 
i z n 



= 1-1 P(A i ) + I P(A i AA.) - 
i i< j 3 



Now the sample space f2 consists of all placements (Max- 

i i k 

well-Boltzmann) of k balls into n boxes. Thus \Q\ = n . The 

event A^ consists of all placements of k balls into the last 

n-1 boxes. Thus |a^| = (n-1)^. Similarly |AjAA2l = (n-2)^ 

/ ^ ^ * k ^ ^ 

and so on. So the probability of A^ is P(A^) = ^ — = / 

n 

( n 2 } ^ 2 k 

that of A.rtA. is P(A.*A.) - 4 — = (1 — ) , and so on. There- 
l -) i -) k n ' 

n 

fore : 
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P(W<k) = 1 - ( n )(l-i) k + (^)(l-|) k - 



As a final application of these ideas, we consider the 
problem of writing max (x^ ,x 2 * • • • ' x n ) ' ^ or n rea ^ numbers 

x x , in terms of their minima. We won't go through 
1 n 

all the details. The idea is to consider the set of real 
numbers as being the sample space ft and to use the indicators 

1 {-,*)• For exam P le ' I (-« f x) I (-» f y) = I (-,x)^(-«,y) = 

I (-co,min(x,y) ) and T (-«,x) + I (-~,y) = 1 (-» ,max (x ,y) ) + 

1 (-<»,min (x,y)). We leave it as an exercise to show that: 



max(x, ,x , . . . ,x ) = E x . - Z min(x. ,x.) + I min(x.,x.,3 
1 1 n i i<j 3 i<j<k 3 



„ 1 -i 

- + ... + (-1) min(x lf x 2 , ,x r ) 
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2 . General Random Variables 

So far we have considered only integer random 
variables. We now allow random variables to take any real 
value. Unfortunately technical difficulties will appear that 
didn't occur with integer random variables. We begin with an 
example so that we can gradually work our way into the dif- 
ficulties. 

Consider the process of dropping a point on the inter- 
val [0,a]. Intuitively the point is just as likely to fall 
on one part of [0,a] as another. For example it should be 
just as probable for the point to fall on the left half of 
the interval as to fall on the right half. More generally, 
the probability that the point falls in any given subinterval 
is proportional to the length of that subinterval. Unfor- 
tunately this leads to the inescapable conclusion that the 
probability of the point taking any one particular value x 
is zero . 

So we see that the intuitive concept of an integer 
random variable, i.e. of a variable which takes its values 
with certain probabilities, is inadequate for describing the 
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phenomenon of a general random variable. In fact there is 
the an intriguing philosophical paradox here: how can the 
point land anywhere at all if the probability of its landing 
in any one place is zero? We will avoid such seeming para- 
doxes by decreeing that the probabilistic structure of a 
random variable is given by the probabilities that it takes 
values in intervals . More precisely if X is a random variable, 
the probabilistic structure of X is given by the probability 
that X is between c and d for any real numbers c<d. We write 
P(c<X<d) for this probability. For example, if X is the random 
variable corresponding to a point dropped at random on [0,a], 
then for any pair of real numbers c<d in [0,a] , 

P(c<X<d) = — . 
a 

As another example, let X be an integer R.V. Then 

P(c<X<d) = £ p n . 

c<n<d 

There is a neat way to express the probabilistic 
structure of random variables in general: the ( probability ) 
distr ibution function. We define this to be the function 

F(x) = P(X<x) . 

To compute probabilities on "half-open" intervals we use 
the fact that: 
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P(c<X<d) = P(X<d) - P(X<c) = F(d) - F(c) . 



For intervals in general we use limits and the above formula. 
Therefore the probabilistic structure of a random variable 
is completely determined by its distribution function. 

Consider once again the random variable X corresponding 
to dropping a point at random on [0,a], The distribution 
function of X is 



F(x) = P(X<x) = f if x<0 

J x/a if 0<x<a 
1 if x>a 



When a random variable has this distribution function we 

shall say that it is uniformly distributed on [0,a]. Typically, 

distribution functions will have "kinks". 




Graph of the distribution function of a 
uniformly distributed random variable 
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We see that the probabilistic meaning of dropping a 
point at random is we have a random variable X 

uniformly distributed on [o,a] . We might also say that we 
are "choosing" or "sampling" a point at random from [o,c] . 
The process of sampling a sequence of n joints at random 
from [o,a] is called the Uniform process , More precisely, 
a Uniform process of sampling n points from [o,a] is a 
sequence of n independent random variables , X,, , . . . , X n 
uniformly distributed on [o,a]. It is the continuous 
analog of the finite sampling process in chapter II. note 
that we do not have to distinguish between sampling with or 
without replacement because the probability that any two of 
the sampled points coincide is zero. 

A typical question one may ask about the uniform process 
is: what is the length of the gap between zero and the smal- 
lest point of the n dropped points. The naive answer is "it 
depends on which X ± is the smallest". We shall answer the 
question with a probability distribution function. More pre- 
cisely write X (1) (pronounced "X order 1") for the smallest 

point: X (1) = min (X ± , . . . ,X r ) . Then 

, £ \j j\ the probabilistic answer to our question 

x — ~— a 

is the distribution function of X (ij* 

To compute this note that the event 
(X (1) >t) is the same as saying that all the X i are greater 
than t . Hence: 

P(X (1) >t) = P((X 1 >t)^(X 2 >t)A..A(X n >t)) 

= p(x 1 >t)p(x 2 >t) . . .p(x n >t) 



( a-t > n 
a 
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[of course to justify this computation rigorously we must 

define independence for arbitrary random variables. We 
will do this in the next section.] Therefore the distribution 
function of is 

F (1) (x) - P(X (1) <x) = l-P(X (1) >x) = l-(^2i) n . 



F(x)x 




The distribution function is more and more "concentrated" 
near as n increases: the more points one drops, the more 
likely that the first gap is small. 

We need a way to express more clearly the fact that 
the distribution is more concentrated near for larger n. 
Indeed as we shall see, the distribution of a R.V. is not a 
very good way to visualize the behavior of the R.V. A 
better way is to use the density of the R.V. 
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The density of a R.V. X is the derivative (if it 
exists) of the distribution: f (x) = ^ F(x). By calculus, 
•x 

f(u)du = F(x). For example the density of X 1 in the 

) —00 

uniform process is 

!l/a 0<x<a 
x<0 or x>a 



f x (x) 
1/a 



a x 



Graph of the density of 



Using density we see much more clearly why X^ is said to be 
uniformly distributed on [0,a]: its density is constant on 
[0,a] , 

On the other hand, the density of is 
n-1 



n (a-x) 



f (1) (x) = 



n 



0<x<a 



x<0 or x>a 
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Notice how the density is sharply peaked at x-0 just as we 
intuitively would expect. 

The Concept of Random Variable 

We are now ready to give rigorous definitions of 
the intuitive ideas in the last section. 

Definition . A random variable X is a function from a sample 
space ft to the real numbers, with the property that the sub- 
sets (X<x) = {weft: X(w)<x} are events of ft for all real 
numbers x. The ( probability ) distribution function of a random 
variable X is the function 

F(x) = P(X<x) . 

As similarly noted for integer R.V.'s, the technical as- 
sumption that the subsets (X<x) are events will never bother 
us. We state it for purely grammatical reasons. 
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Integer Random Variables 

Integer R.V. 's are characterized by the fact that 
their distribution functions are constant except at inte- 
gers, where they have discontinuous jumps. 



x 



-2 



-2 



-1 



-1 



F(xH 
1 







p 



i r 



12 3 
Graph of the distribution function of an integer random 

variable . 



Being a discontinuous function, the distribution function 
of an integer R.V. is rather unpleasant to deal with. As a 
result one generally considers instead the probability distri- 
bution p n - P(X=n). It is unfortunate that F(x) and p n are 
both referred to as the distribution of an integer R.V. 

Continuous Random Variables 

A random variable X is a continuous random variable if 

its distribution function F(x) is continuous and piecewise 

dif ferentiable . The derivative f (x) = F'(x) is called the 
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density of X. It is the continuous analogue of the proba- 
bility distribution p n of an integer R.V. This can be made 
quite precise using infinitesimals: the probability that X 
takes a given value x is the infinitesimal f(x)dx. In other 
words, the probability that X takes a value in a very small 
interval [x,x+h] is close to f(x)h, the smaller the interval, 
the closer the approximation. 

This suggests that the probability for a continuous 
random variable X to take a given value x is not quite zero 
but rather infinitesimal, if f(x)^0. So although (X=x) is 
an unlikely event, it is not impossible. We will write 
dens(X=x) for the density f (x) of X at x. However, one 
should take caution when using this notation: dens(X=x) 
does not act like a probability P (X=x) . To give a concrete 
example, let X be uniformly distributed on [0,1] . Then 2X 
is uniformly distributed on [0,2], Hence dens(X=x)=l ^ 
*r = dens(2X=2x) , even though the events (X=x) and (2X=2x) 
are obviously the same. In general, before performing any 
calculations involving densities, one should first convert 
them to probabilities. For example, 

dens(X-x) = P(X<x) = |^ P (2X<2x) 

dens(2X=2x) = d . . p(2x<2x) = i %- P(2x<2x) 
d(2x) — 2 ax — 

therefore, dens(X=x) = 2 dens(2X=2x). 



The density of X acts precisely as a mass density on 
the real line, a familiar concept in calculus. Thus, for 
example, to compute P(a<X<b) we must integrate the density 



P(a<X<b) = 



rb 

f (x)dx 

a 



In the case of an integer R.V. we get a sum: 



n 

P(k<X<n) = E p. . 

i=k 1 



The integral is the continuous analogue of a sum. 
Independence 

The concept of the independence of two arbitrary R.V. 's 
ought to be obvious, given the definition in the integer 
case. Namely two R.V.'s X and Y are independent if the 
events (X<x) and (Y<y) are independent for any pair of real 

numbers x and y: 

P( (X<x)/\ (Y<y) ) « P(X<x)P(Y<y) . 
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Properties of Densities and Distributions 

The distribution function F(x) of an arbitrary R.V. 
satisfies : 

(1 ) F(x) < F(y) if x < y 

(2) Aim F(x) = 

(3) iim F(x) = 1 

(k) F is left continuous, i.e. J?im F(y) = F(x) 

y^x 
y>x 

All these are obvious consequences of the definition of the 

distribution function. It is an interesting exercise to 

show the converse: any function F(x) satisfying (l)-(4) is 

the distribution function of some R.V. X on some sample space 

When X is a continuous R.V. , its density f(x) satisfies 

properties analogous to those of the distribution p of an 

n 

integer R.V. Namely, 



(1) f (x) > 
r 

(2) 



00 

f (x)dx = 1 



Joint Distribution and Joint Density 

Just as we did for integer random variables, we measure 
the correlation of two arbitrary R.V. 's by using a joint dis 
tribution function . The joint distribution function of 
R.V. 's X and Y is 



F(x,y) = P((X<x)MY<y)) . 
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If X and Y are continuous, then they also have a joint 
density : 

dens(X=x, Y=y) = 

In terms of infinitesimals, the probability that X takes 
the value x and Y takes the value y is f(x,y) dx dy. 
As with ordinary densities, be careful not to treat 
dens(X=x, Y=y) as a probability. 

Suppose "that F y/ F Y "and f x ,f~ denote the distribution 
functions and densities of the continuous R.V.'s X and Y 
respectively. We can recover these from their joint 
counterparts : 

r 

f (x,y)dy = We ca ^ th ese tne marginal 

f 00 densities or marginals . 

f (x,y)dx = f y (y) 



F x (x) = lim F(x,y) 

y-vco 



F y (y) = Um F(x,y) 



We won't have much use 
for the last two formulas 



In terms of the joint distribution and joint density, two 
random variables X,Y are independent if and only if 

F(x,y) — F (x) F (y) 

X Y 

or f(x,y) = f (x) f (y) 

X Y 
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Expectation 

For an integer R.V. X, the expectation of X is the 
mean or average value of X: E(X) = E np . For a continuous 



n 



n 



R.V.X, the expectation is the continuous analogue: E(X) = 

xf(x)dx, if it exists. One should immediately recognize 
this as the center of mass of the mass density given by f (x) . 

The expectation of a continuous R.V. also satisfies 
the property we found so useful for integer R.V. 's: 

Basic Fact . For any two continuous random variables X and y , 
E(X+Y) = E(X) + E(Y) . 

Proof . This is essentially the same proof as in the integer 
case . 



E(X+Y) = 



.00 


-00 


} _CO 




n 


[" 


J — CO J 


—00 



(x+y) f (x,y)dx dy 



x f (x, y) dx dy + 



oo og 

J 1 



yf (x,y)dx dy 



— 00 — oo 



x f x (x)dx + d 



yf y (y)dy 



= E(X) + E(Y) . 



3* The Uniform Process 

We now make a detailed investigation of this process 
in order to illustrate the concepts we have just introduced. 

Recall that the Uniform process of sampling n points from 
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[o,a] is the same as a sequence of independent random 
variables X 1 , X 2 , . . . , X n each being uniformly distributed 
on the interval jo,a] . For example, these random variables 
might be the measurements of the heights of a random sample 
of n people. If we wish to ignore the order in which the 
people are measured, we simply write down the heights in 
increasing order. We call this new sequence the order 
statistics of the original sample. In effect we "forget" 
what the order of sampling was and consider only the set 
of n measurements. 

To be more precise we introduce the following notation: 

X ^ = min (X^ . . . ,X n ) 

next larger point after X^j 
* • 

max (X^,...,X n ) 

How are the order statistics distributed? What are their 
joint distributions? What are the distributions of the gaps 
between successive order statistics? Unlike the gaps in the 
Bernoulli process, these are not independent; for if one is 
big, the others must be small. What are the joint distri- 
butions of the gaps? We shall now answer these and other 
questions. 



(2) 



X 



(n) 
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Let F (k)( x ) an ^ ^(k)^ x ^ denote the distribution and 

density of the k*"* 1 order statistic X,, » . Thus F » (x) - 
1 (k) (k) 

P(X^ k j <^ x) . Now (X^) 1. x ) is tne event "at least k of 

the n points fall in the interval [0,x]". We decompose 

this event according to the number 

¥ (k) 

^ t of points that actually fall in 



i # \ \ 

. x 



at least k 
fall here 



f ,x] . Therefore : 



F (k) (x, =P(X (k) <x, = (»)#*(5?)»-* + t^xl,^,^,-^. 



... + ( n ) (— ) n . For example, the first summand is the 
n a 

probability that exactly k points fall in [0,x], and hence 
exactly n-k fall in (x,a] . Similarly for the other sum- 
mands . Needless to say this expression is awkward. 

Consider now the density M- We first compute that 

P ( (x<X . <x+h) and no other X,.. falls 
in [x,x+h] ) = 




k-1 fall n-k fall 

here here 



,n-l. ,x,k-l h ,a-x-h x n-k 



Here one of the n points falls in [x,x+h] : probability n*— 
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Next, k-1 of the remaining n-1 points fall in [0,x]: 
n-1 x k-1 

probability (, , ) {— ) . Finally the remaining n-k points 
k— i. a 

^ x~ "h n k 

fall in [x+h, a] : probability ( ) . Unfortunately 

a 

what we really want is P{x<X^ k j <^ x+h). This appears to be 
a much more complicated computation. 

However, we never really have to compute this expres- 
sion for the following reason. If more than one of the 

fall in [x,x+h] , the resulting probability involves a factor 

of (~)^ (or possibly even a higher power of ^) . Thus 
a a 

P(x<X,,. < x+h) « n( n "h * k 1,h ' (a-x-h) n — + h # (complicated 
{K) ~ K_1 a n a expression) 



Now divide by h and take the limit as h+0 : 

P(x<X (k) <x+h ) ,n-l, x fc ~ 1 (a-x-h) n - k , h lnrn<r 

f (k) (x) = V = *™ "'k-l' n + a * (crUd 

h-»-0 h+0 a 



f <k) (x > = ] n Cl» xk " 1(a n X)n k ' if 0<x<a 







otherwise 
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We never have to compute the complicated expression be- 
cause no matter what it is, it disappears when we let h go 
to zero. We shall use this trick repeatedly. In fact it 
is precisely because we can make this kind of simpli- 
fication that the density is so much more computable than 
the distribution. 

We now mention an interesting application. The function 
f (x) is a probability density so it integrates to 1: 



f (x)dx = 

— 00 



rdi ra , k-1 , »n-k 

nf 11 " 1 )* dx 

nv k-l' n ax " 



f (k) (x)dx - 



^' >Q ^ a J 



Therefore : 



a ]^_2. n a n 

x (a-x) dx = t — . Thus just as integer 

•0 n( k _ 1 ) 

R.V.'s allow us to compute certain infinite series proba- 
bilistically, continuous R.V.'s furnish a technique for 
computing certain definite integrals. We shall see more of 
these as we go on. 

Next we consider the joint distribution of two order 
statistics. For example, how does the tenth point influence 
the twentieth? This is an important question in biostatistics , 
because of the necessity of biologists to rely on small 
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samples. Tables of order statistics allow one to detect 
deviations from randomness in a relatively small sample. 

As with the above computation it is much more con- 
venient to compute the joint density. Let 
two of the order statistics, j<k. Then the joint distri- 
bution is 



F (j,k) (x ' y) = P((X (j) ± x)rt(x oo - y)) 

3 3 

and the density is f (j^) (*,y) = 3y F (j,k) * x ' v) • A 9 ain 

as with the computation above we 
x x+h y y+c need only compute the probability 



Y v of the event "X, M falls in [x,x+h] , 

X (j) X (k) (D) 

X ^ j falls in Ey,y+c] and no other 
points fall in these intervals". We think of these two inter- 
vals as dividing [0,a] into 5 boxes into which we drop n dis- 
tinguishable balls with occupation numbers: j-1, 1, k- j , 1, n-k. 



( " ) 

\ j-l,l,k-j-l,l,n-k/ 



There are ( J ways 

x x+h y y+c a 

I 1 — ( 1 — | 1 to place the n balls with these 

j-1 1 k-j-1 1 n-k 

occupation numbers. Therefore 

Boxes and occupation 

numbers the event in question has proba- 

bility 
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( n \ j-1 h ^ y-x-h ^ k- j -1 e_ ^ a-y-C ^ n-k 

j-l,l,k-j-l,l,n-k/ a a a 



Dividing by he and letting h-*-0 and e-*-0 gives the joint 
density : 



(j,k) 



(n \ .k-j-1, .n-k 

) x ( y- x) n (a - y) 
j-l,l,k-j-l,l,n-k/ a 



or 



f <j,k> (x ' y) = 



j -1 , . k- j-1 - . n-k 
n! x J (y-x) (a-y) . f 

I (k-j-1)! (n-k)! n lf X<y 

a 

if x>y 

- 



Finally we consider the joint density of all n order 
statistics. Let x 1 <x 2 <...<x n be real numbers in [0,a]. Now 

P((x 1 <X (1) < x 1 +h 1 )A(x 2 <X (2) < x 2 +h 2 )A...n(x n <X (n) ix n +h n )) 

**1 h 2 h n 

x, x,+h.. x x +h = n! - — — — , because 

lllnnn aa a 

Y* 

X ... x, , there are n! ways of placing the 

(1) (n) 

n points in the intervals 

[x 1 ,x 1 +h 1 ] , . . . , E x n ' x n +h n J • The h i' s are chosen so small that 
there is no overlap. Therefore the joint density of all n 
order statistics is: 
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n!_ 

n 



if 0<x, <x~< . . . <x <a 
— 1 z n— 



otherwise 



Like all densities, f(x lf ...,x ) integrates to 1. 
we get the interesting multiple integral: 



Thus 



ra 




dx 



dxJ. . ,dx_ = 



J 



n 



n 



nl 



This is reasonable because the conditions 0<x 1 <x 2 . . . <x n l a 
determine a "pyramid" cut off the n-cube of side a at one 
corner. 

The gaps of the uniform process are the distances 
between successive points in increasing order. The gap 
between and X^j is written 1^, the gap between X^j and 



L l L 2 



x 



(1) "(2) 



<(n) 



X,.^,v is written L.^.and the gap between X, * 
(j+1) ^ c (n) 



and a is 



L , . The order statistics may be written in terms of the 
n+1 



gaps : 
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X 



(k) 



= L 



1 



'k 



The gaps are not independent: if one is large the others 
must be small. But the gaps are nevertheless equidistributedi 
When we have conditional probabilities, we will be able to 
prove this rigorously. However one can prove this probabi- 
listically. Since one of our main objectives is to learn 
to think probabilistically, this kind of proof is actually 
preferable. 

Imagine that we drop n+1 points on a circle of circum- 
ference a. It is intuitively obvious by symmetry that the 
gaps (measured along the circumference) so obtained are all 
equidistributed. On the other hand, this experiment is sta- 
tistically equivalent to the following experiment. Fix one 



point (call it 0) on the circle and then drop n more points 
at random. If we cut the circle at and stretch it out 





7 points at random 
on a circle 



6 points at random 
on a circle plus a 
fixed point 
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over the interval [0,a], then the gap distributions on [0,a] 
are the same as those on the circle (the probability that 
another of the n points falls at the same place as is zero) . 
Therefore the gap distributions on [0,a] are all equidistributed 
This completes the proof. 

Therefore all the gaps are distributed the same as 
L l ~ X (l)* We alreadv com P uted the density of X^j so the 
density of any gap is 



f ( X ) = H- (a-x) 11 " 1 on [0,a] . 
n 

a 



The expectation of is given by: 



E(L.) = 



n , \n-l, n 
x — (a-x) dx = — 

a n 



x(a-x) dx , 



but there is an easier way to compute this. Since 



E(L 1 ) = E(L 2 )=...= E ( L n+ i> ' 



and since 1^ + L 2 +...+ \ +1 = a , we conclude that E(L,.) = 

by the Basic Fact. We can now appreciate the power of the 
Basic Fact, for the I^'s are not independent. 



3.59 



Similarly we can compute E(X^j) quite easily. For 
X (i) = L i + *«- +L i implies that E(X (i) ) = E (L^) +. . ,+E (L^) = . 

This is certainly what one would intuitively expect, but a 
direct computation would be tedious. 

Consider now a seeming paradox. Suppose we label a 
reference point g on a circle of circumference a, then we 
drop n points at random. What is the expected length of the 
gap that includes the reference point g? The answer is r 

not jj- as one might intuitively expect. The paradox lies not 

in any contradiction but rather 

in having a false intuition. Think 

of the experiment in reverse order: 

drop n points at random and then 

choose a reference point g . Then 

g is more likely to fall in a longer 

4 points at random gap simply because it is longer, 

and a reference point 

g. The seeming paradox comes from the 

impression that one is performing 
the following quite different experiment: drop n points at 
random on a circle and then pick a gap at random (i.e. any 
gap is as likely to be chosen as any other) . This experiment 
does indeed have expectation — . 

3.60 




Table of Probability Distributions 

Random variables are a central concept in the 
theory of probability. For example we saw that the uniform 
process is simply the study of n independent, uniformly 
distributed random variables. One could regard probability 
theory abstractly as the study of certain functions on 
sample spaces, which satisfy certain laws. However this 
would miss the point, because it is the examples that make 
the theory, and we can only learn probability theory by care- 
fully studying the examples, especially the important ones. 

Random variables are classified by their distributions. 
And when one speaks of a distribution one usually has a 
standard model in mind. Learning probability theory there- 
fore requires learning not just the distribution but also 
the natural phenomena that give rise to them. We will now 
make a list of distributions and models. We will add to our 
list in subsequent chapters. 

Bernoulli distribution . X has the Bernoulli distribution 
if X is an integer R.V. which takes just two values, and 1. 
This distribution depends on one parameter, p=P(X=0). The 
standard model for this distribution is a toss of a biased 
coin, X^ , with bias p in the Bernoulli process. 
Binomial distribution . X has the binomial distribution if 

X is an integer R.V. and P (X-k) = (£)p k q n ~ k . The binomial 

distribution depends on two parameters n and p. The standard 

model is S , the number of heads in the first n tosses of 
n 

the Bernoulli process. Here p measures the "bias" of the coin. 
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Geometric ( Pascal ) distribution . X has the geometric 
distribution if X is an integer R.V. and P(X=n) = q 11 1 p. 
The standard models are the waiting time Wj^ for the first 
head in the Bernoulli process and the gap between the 
(k-l) st and k th occurrences of heads in the Bernoulli process. 

Negative Binomial distribution . X has this distribution if 

X is an integer R.V. and P(X=n) = (£"*) q n "" k p k . This distri- 

th 

bution has one parameter k . The standard model is the k 
waiting time of the Bernoulli process. 

Uniform distribution . X has this distribution if X is a 

continuous R.V. with density f (x) = f 1/a if 0<x<a 

1 

^ otherwise 

The standard model is any X i# "dropping a point at random", 
in the Uniform process. Here the parameter a is the length 
of the interval on which one is dropping (or sampling) points. 

Distributions of order statistics . These are sometimes called 
the Dirichlet distributions . X has one of these distributions 
if X is a continuous R.V. and its density is 



i k-1 . .n-k 

ct \ f n_1 \ x ( a-x) 
f(x) = n( k ^) - ■ 



if 0<x<a 
otherwise 
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There are three parameters a,n and k. The standard model 
is the k th order statistic X (k) of n points dropped at 
random on [0,a]. The gaps 1^ between the order statistics 
are all models for the distribution having k = 1. 



Distribution 



tyjae 



parameter (s) modal (s) 



Bernoulli 
Binomial 



Geometric 
(Pascal) 



Negative 
Binomial 

Uniform 

Dirichlet 



integer 
integer 

integer 

integer 

continuous 

continuous 



P 

n,p 



k 
a 

a ,n ,k 



X. in the Bernoulli 
1 

process 

S (X. when n-1) in the 
n 1 

Bernoulli process 



W, or any T^ in the Bernoulli 

process 



X^ in the Uniform process 
X^j (L^ when k=l) in the 



Uniform process 



Distribution Probability distribution or density Expectation 

Bernoulli p Q = q = 1-p, p 1 = P P 

Binomial p^ = (£)p^q n ^ np 

Geometric p n = q 11 ^"p 1/p 

n "~ 1 n"~k k 

Negative Binomial p = (, ,)q p k/p 

n k — j. 

Uniform f (x) = 1/a on [0,a] a/2 

Dirichlet f(x) - n x*" 1 (a-x) n-k c -T n on [0 , a ] ka/ (n+l) 



Table of Bernoulli and Uniform Distributions 
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5. Exercises for 
Chapter III Random Variables 
Integer Random Variables 

1. The thirteen diamonds are taken from a deck of cards and are 
thoroughly shuffled. One diamond is drawn at random and scored 
as follows: two through ten score as their rank, face cards 
score ten and the ace scores eleven. Let S be the score. 
Describe the sample space and probability measure used in 

this problem. Write out S explicitly as a function on the 
sample space. Write out the probability measure P explicitly 
as a function. Do S and P have the same domain? 

2. In San Francisco, a drunk leaves a bar and every 10 seconds 
staggers either one yard down the street with probability 3/4 
or one yard up the street with probability 1/4. Where is the 
drunk after one minute? after two minutes? What is his most 
likely location in each case? How is the most likely location 
varying in time? 

3. A machine that produces screws is subject to occasional 
surges in its power supply. These occur independently during 
each second of time with 90% probability and the machine 
produces one screw every second. In one version of the machine 
there is a fuse that shuts off the machine permanently when a 
power surge occurs. We wish to know how many screws the machine 
produces after it is turned on. Which random variable 

in the Bernoulli process coresponds to this question? Answer 
the question. 

4. Another version of the machine in exercise 3 has a temporary 

circuit breaker so that during a power surge the functioning of 
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the machine is interrupted only for one second. We run the 

machine for one minute and wish to know how many screws are 
produced. Which random variable in the Bernoulli process cor- 
responds to this question? Answer the question. 

5. In a bridge game the deck is thoroughly shuffled and dealt. 
You are dealt a hand containing four spades. How many spades 
was your partner dealt? 

6. Three office workers take a coffee break. They choose one 
of their number at random to pay for the coffee as follows. All 
three flip a coin simultaneously and the one having a different 
outcome pays for the coffee. If all coins come up the same, 
they flip the coins again. How long does it take to determine 
who pays for the coffee? 

7. If one has a coin with a bias p ± 1/2, one can nevertheless 
use it to synthesize a fair coin by the following trick. Flip 
the coin twice. If the two tosses come out different, we can 
say that we got heads if the first toss was heads and tails 
otherwise. If the two tosses were the same, we toss the coin 
two more times and proceed as above. Show that this produces 

a fair coin toss. How many tosses of the biased coin are re- 
quired to produce one "fair toss"? 

8* Given any bias p between and 1 and a fair coin, one can 
synthesize a biased coin toss with this bias as follows. Write 
the binary expansion of q = 1 - p . This is just a sequence of 
zeroes and ones after the decimal point (binary point?) . Now 
start tossing the fair coin. When we get a head write down a 

3.65 



1 and for a tail write a . Compare the sequence we obtain 
with the binary expansion of q . Continue tossing until the 
first time that the two sequences differ. At this point we 
stop and record what happened on the last toss. Show that what 
we record is equivalent to a biased coin toss with bias p . How 
long does it take to complete such a toss? Does it depend on p? 

Independence 

9. In exercise 2, is the position of the drunk after one minute 
independent of his position after two minutes? 

10* Prove that if X and Y are independent random variables 
and if f (x) and g(y) are two functions, then f(X) and g(Y) 
are also independent random variables. 

Expectation 

11. What is the distribution of S in exercise 1? What is its 
average value? 

12. Compute the average position of the drunk in 

exercise 2 after one minute and after two minutes. How is the 
drunk's average position changing in time? How do these questions 
differ from the questions asked in exercise 2? 

13. In the game of Chuck-A-Luck , three dice are agitated in a 
cage shaped like a hourglass. A player may wager upon any of 
the outcomes 1 through 6. If precisely one die exhibits that 
value, the player wins at even odds; if two dice show that value, 
the player wins at 2 to 1 odds; if all three dice show the player' 
choice, the payoff is 3 to 1. If none of the three dice show the 



player's choice, the player loses. Compute the expected value 
of the player's winnings on a bet of one dollar on "2". Is the 
game fair? If not, suggest payoff odds that would make the game 
fair. 

14. What is the average number of dots shown by a die tossed 
once at random? You wish to maximize the value shown by the die. 
If you are allowed to throw the die a second time, when should 
you do so? What is the expected value shwon by a die for which 
one is allowed one rethrowing? 

15. James Bond is imprisoned in a cell from which there are three 
obvious ways to escape: an air-conditioning duct, a sewer pipe 

and the door (the lock of which doesn't work) . The air-conditioning 
duct leads agent 007 on a two-hour trip whereupon he falls through 
a trap door onto his head, much to the amusement of his captors. 
The sewer pipe is similar but takes five hours to traverse (it 
takes longer to swim then to crawl even for James Bond). Each 
fall produces temporary amnesia and he is returned to the cell 
immediately ofter each fall. Assume that he always immediately 
chooses one of the three exits from the cell with probability 1/3. 
On the average how long does it take before he notices that the 
door is unlocked? 

16. As new engines are coming off the assembly line in Detroit, 
they are tested to determine the maximum deliverable horsepower. 
In a lot of 50 engines, 49 deliver a maximum of 200 horsepower, 
while one of them doesn't work at all thereby delivering a maximum 
of horsepower. What is the average maximum horsepower of the 



engines in the lot? Is the average a reasonable description 
of the maximum horsepower of the engines in the lot? 

17. A gambler hits upon what seems to be a foolproof system. 
He begins with a one-dollar bet playing the game of black-or- 
Red in Roulette, and each time that he loses he doubles the 
amount bet over the previous bet until he wins once at which 
point he quits. In this way he stands to recoup his losses when 
he finally does win. He realizes that there is a small chance 
that he will lose everything he has ($1023) , but he considers 
this probability to be small enough that he can ignore it. The 
probability of winning on a given trial is 18/36, in which case 
he wins an amount equal to what he originally bet, otherwise he 
loses his bet. What is the probability that he eventually wins 
and what is his net gain when he does? What is the probability 
that he loses all and how much does he lose? What is his average 
net gain using this system? Is it really foolproof? Is the risk 
he is taking a reasonable one? 

18. Compute the average length of a Craps game. For the rules 
of the game see exercise II. £0. 

In the remaining problems of this dection one will need a 
hand calculator. In addition we mention the following very use- 
ful formula known as Euler ' s approximation to the ha rmoni c series : 




+ 




£n(n) + 0.57721 



where 



£n denotes the natural 



logarithm (log e ) and 0.57721 



is known as Euler 1 s constant. 
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19. A young baseball fan wants to collect a complete set of 
262 baseball cards. The baseball cards are available in a 
completely random fashion, one per package of chewing gum, which 
she buys twice a day. How long on the average does it take her 
to get the complete set? 

20. A super power has 262 missiles stored in well separated 
silos. An enemy is considering a sneak attack. However, for the 
attack to succeed every one of the missiles must be destroyed 
(the missiles are MIRVed: each has 5 independent warheads) . We 
will consider this problem later, but for now we consider the 
following simple model. Assume each attacking warhead hits one 
of the enemy missiles with each enemy missle being equally likely 
to be the one that is hit. How many warheads on the average will 
be needed to ensure the destruction of every enemy missile? 

21. The analysis in exercise 20 is overoptimistic for several 
reasons. There is a significant probability that a given war- 
head will hit more of the silos. Furthermore we want no the 
average number of warheads required but rather the number of war- 
heads needed to ensure with very high probability (say 99%) that 
all the enemy missiles have been destroyed. Compute the number 
of warheads needed if each attacking warhead has probability 0.75 
of hitting its target? Even this is optimistic inasmuch as the 
shock waves form nuclear explosions are such that one cannot 
expect the various warheads converging on one target or on nearby 
targets to be independent. However, it gives one an idea of just 
how foolhardy so-called pre-emptive warfare can be. 
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22. A molecular beam is firing metal ions toward the face of a 
crystal. If an ion strikes an unoccupied site on the crystal, it 
promptly occupies that site, otherwise it bounces away and is lost 
Every ion hits the crystal somewhere with each site being equally 
likely. If there are lO 1 ^ crystal sites, how many ions must the 
beam fire at the crystal, on the average, in order to fill every 
site? 

2 3. Guests arrive at random at a party, and the host seats them 
as they arrive successively one at a time around a large circular 
table. Twenty guests arrive, ten single men and ten single women. 
On the average how many of the twenty adjacent pairs around the 
table will consist of a man and a woman? 

24. The host in exercise 23 invites twenty couples to a cocktail 
party. As the couples do not know each other, the host decides 
to mix his guests by assigning each man to one of the women in 
such a way that every possible arrangement is equally likely. 
How many couples on the average find themselves assigned to each 
other? See exercise 58. 

25. Return to exercise II. ^ . How many people on the average 
call out their birthdays before a match is found, assuming that 
one is eventually found? How does it compare with the observed 
value? 

26. The Polish mathematician Banach kept two match boxes, one in 
each pocket. Each box initially contained n matches. Whenever 
he wanted a match he reached into one of his pockets completely 
at random. When he found that the box he chose was empty, how 

many matches were in the other box? how many were there on the 
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27. Compute the average energy of the configuration in exercise 
11.30. 

28. James Bernoulli proposed the following dice game. The 
player pays one dollar and throws a single die. Ke then throws 
a set of n dice, where n is the number shown by the first 
die. The total number of dots shown by the n dice is then used 
to determine the payoff. 

If the number is less than 12 he loses the bet, if 
the number equals 12 his dollar is returned, while if the number 
exceeds 12 he receives two dollars. Find the expected number 
of dots shown by the n dice. Is the game favorable to the 
player? 

29. Nicolas Bernoulli proposed the following coin-tossing game 
which has since been called the St. Petersburg paradox. A player 
pays an entrance fee of E rubles to the casino. A coin is then 
tossed until it comes up heads. If it requires n tosses to get 
the first head, the player is paid 2 n rubles, for a net gain 

of 2 n -E rubles. What is the player's expected net gain? [Answer 
infinite net gain no matter how large E is] The paradox arises 
from the fact that one is placing no limit on the resources of 
the casino. If the casino possesses a total of P = 2 N rubles, 
compute the net expected gain of the player. For the game to be 
fair what should E be? [Answer: N + l rubles]. For example, 
if the casino has resources of 33.55 million rubles, what entrance 
fee would be fair? 

30. What is the expected duration of the St. Petersburg game for 
the casino mentioned at the end of exercise 29? 



31. What is the probability that in n tosses of a fair coin 
two heads never occur in a row, i.e. no run of 2 or more heads 
ever occurs? 

32. The generalization of Chevalier de Mere's first problem 
(exercise II. 27) is called the problem of points. The problem 
concerns a game between two players 

that was interrupted before its conclusion. Suppose that N 
points are required to win the game, that player A has N-a points 
and that player B has N-b points. In a given trial A wins with 
probability p and 3 with probability q = 1 - p . How should the 
stakes be divided? The problem was first solved by Montmort. 
Can you solve it also? 

33. Generalize exercise 14 to produce a kind of analog, for dice 

throwing^of draw poker. The player throws five dice. He then 

has the option to choose a subset of the dice for rethrowing. 

This subset can be empty but cannot consist of all the dice. The 

process is then repeated for the rethrown dice, continuing until 

no more dice may be rethrown. The object is to maximize the 

total number of dots showing on the dice. Devise a strategy and 

calculate the expected outcome for this strategy. The optimal 

4 

strategy will produce an expected outcome of about 24 — . 
Continuous Random Variables 

34. A boy makes a date with his girl: friend. They are to meet 
at some time between 6 PM and 7 PM, but since both are absent- 
minded they forget which time they had agreed upon. As a result 



each arrives at a random moment between 6 and 7. Each waits for 
10 minutes and if the other fails to appear, he or she promptly 
leaves in a blue funk. What is the probability that true love 
prevails (at least this one evening) ? 

35. When five points are chosen uniformly at random from the 
interval [1,2], what is the distribution of the natural logarithm 
of the smallest point? 

36. A gangster stands 10 m from an infinitely long straight 
wall. The gangster fires a gun horizontally in a completely 
random direction toward the wall. Compute the distribution of 
the point on the wall where the bullet hits. Do the same for 
the distance from the bullet to the point of the wall closest 
to the gangster. 

37. A median of a random variable X is any number y such 
that 

1 1 
P(X < y) >_ ■— and P(X _> y) > -~ . 

2 ^ 

Prove that the median of any random variable exists. Does it 
have to be unique? Compute it for the gangster distribution in 
exercise 36 above. 

38? After grading an examination, a teacher arranges the papers 

in order by grade. The sample median is the middle grade if there are 

an odd number of papers and is the average of the two middle 

grades otherwise. Give a definition of the sample median in the 

Uniform process and compute its distribution. 
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39. How far apart are the largest and the smallest points in 
the Uniform process of sampling n points from [0,a]? We call 
this the spread . Compare the spread with the second largest 
order statistic, ^( n -]_) * 

40. Show that any function satisfying the four properties of a 
distribution function is in fact the distribution of some random 
variable on some sample space. 

41? One can also develop a theory of discrete order statistics. 
For this the interval [0,a] is replaced by the set of integers 
{1, 2,--- , A}, each of which is equally likely to be chosen, and 
a given integer may be chosen more than once. The formulas one 
gets are quite complicated. It should be clear, however, that 
when A is large compared to n, the number of points chosen, 
we may approximate the discrete order statistics with the continuous 
ones. Tne principle that the gaps are equidistributed holds both 
for the discrete and for the continuous cases. Compute the dis- 
tribution of the first order statistic. Note that this is an 
integer random variable. 

42? During World War II, the Allies estimated the number of 
tanks that had been produced by German industry by collecting the 
serial numbers of abandoned tanks. There are actually two questions 
one can ask here. One can ask for the most likely number of tanks 
that have been produced, or one can ask for the most reasonable 
rough estimate of the number. The former question would be most 
appropriate if we placed a very high value on getting the exact 
number, nearby numbers being useless. The latter question is 
clearly more appropriate in the context of this problem. 



To answer these questions we must rephrase them in the 
language of probability. Assume n serial numbers have been 
collected, the largest of which is X (n) . The first question 
should read: what is the number A of tanks such that when n 
numbers are chosen uniformly from {1, ••• , A} the probability 
that we get the actually observed values is as large as possible. 
The answer is X (n j itself. Prove this. We call this the 
m aximum likelihood estimator of A . See exercise 11.23 for 
another example of such an estimator. For the second question 
we want an estimate of A such that if one makes many estimates 
of the same number A by this method we will on the average be 
close to the correct value. We will consider this question later 
in exercise 52. 

43? A biologist is studying organelles in a cell. The organelles 
in question are spheres of equal but unknown radius r within a 
given type of cell, and they are distributed randomly throughout 
the cell. The biologist estimates r by observing a cross- 
section of the cell and measuring the radii of the visible granules 
Suppose that n granules are observed and that the largest 
observed radius is R ( n ) . Fine the maximum likelihood estimator 
of r . The measurements R± , • • • , R n of the n radii will not 
be uniformly distributed. However the random variables 
\Jr 2 - Ri , . . . , \[~c 2 ~ R n 2 will be uniformly distributed. 

44? Compute directly, without first finding densities, the joint 
distribution function of the two order statistics X^j and X ^ 
in the Uniform process. 
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45? Suppose that X lf X 2 , . . . , X n are independent uniformly 
distributed random variables on the intervals [0 , a-^ , . . . , [0 , a R ] 
respectively. Compute the densities and the joint densities of 
the order statistics x (]_) l x (2) 1 * ' ' - X (n) ' 

46? In exercise 45 above, compute the distributions of the gaps. 
Expectations of Continuous Random Variables 

47. Compute the average value of the natural logarithm of the 
smallest point among five chosen uniformly from [1,2]. Is this 
the same as the natural logarithm of the average value of the 
smallest point? Explain this. See exercise 35. 

48. Compute the average of the median of the set of order 
statistics. See exercise 38. 

49. Compute the average values of the random variables in 
exercise 36 (the gangster distributions) . 

50. An enzyme randomly breaks each of 24 identical (and very 
long) DNA molecules into two pieces. How long is the shortest 
piece produced? What is the average length of the shortest piece? 

51. Shuffle a deck of cards and turn up cards one at a time 

until the first spade appears. How many cards including the 

5 3 

spade do we expose on the average? (Answer: —-^3.786) More 

J. 4 

generally if we are looking for one of n cards in the deck, 

how many cards must we expose on the average until we find one 

53 

of them? (Answer: — —r ) 

n + 1 
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52? Return to exercise 42. To answer the second question we 
require a random variable with the property that its expectation 
is A . The maximum likelihood estimator will not do because 



n + 1 v 

Answer: X. 

n n 



E(X, , ) 5^ A . What should one use? 
(n) ' r 

Consider next the corresponding question for exercise 43. 

The situation is now more complicated because the observed radii 

are not uniformly distributed. Find a random variable R for this 

n 



problem such that E(R) =r . 



Answer : 



R 



yn' 



(n) 



53? A long DNA molecule is broken into N pieces. Find the 
average length of the i th longest piece produced, If if N . Use 
probabilistic reasoning as follows. Let n = N--l so that our 
model is the Uniform process of sampling n points from [0,a] , 
where a is the length of the Dl-JA molecule. The problem is to 

compute the expectations of L (i)' L (2)' *"* ,L (n+l)' ^-' e ' °^ 
the order statistics of the gaps. Here is how to compute L (1 ) . 
First find P(L (1) > t) . Now (L (1) > t) is the event 
(L-l > t) Pi (L 2 > t)H * ' * 0(L n+ i > t) . When this event occurs we 
can remove a subsegment of length t from each of the gaps. 
The resulting process in the Uniform process of sampling n points 
from [0, a - (n+l)t] . Geometrically the event ( L (i) > t ^ is an 
n-dimensional cube having side of length a - (n + l)t . Thus 

P(L (1) > t) - ^ ~ ^ n + ^ • Therefore 



dens(L(-|j = t) = n(n + l) 
calculate EfL^j) ^Answer 



(a - (n + 1) t) 



n-1 



n 



(n + 1) , 



It is now easy to 
Similarly to compute 



L (2 ) we simply note that when we remove a subsegment of length 
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from each gap (which we can do since L (i) ^ s t ^ ie sma H est 
gap) , what remains is the Uniform process of sampling n - 1 
points from [0 , a - (n + 1) L ^ ] . Moreover Lj 2 ) is the sum of 

L, and the length of the smallest gap in this smaller process. 

(1) r fa - (n + 1) L(D 

This gives EfL^)) . 



Answer: E(L^)) = E ^ L (l)^ + E 



n 2 



^ a (n + l)a _ a + a _ a f 1 + 1^ 



(n + 1) 2 n 2 n 2 (n+l) 2 (n + 1) 2 n(n + l) n + l^n + 1 nj^ 

Continuing by induction we get EfL^j) for all i . £ Answer: 

E (L / ■ \ ) = — — — !— + ..- + 1 -)] . Although the above reasoning 

(i) n + 1 [n + 1 n-i+2JJ 

is not, strictly speaking, rigorous, we will show how to make it 
completely rigorous in Chapter V. See exercise V. 

54? A biologist allows an enzyme to break a DNA molecule into 
10 pieces. The original molecule was 10,000 base pairs long. 
Upon examining the pieces, the biologist finds that the smallest 
is only 10 base pairs long! How probable is it that the smallest 
of 10 pieces could be this short or shorter? Use the results 
of exercise 53. Does the biologist have a case for believing that 
the enzyme attacked the DNA molecule in a non-random fashion? 
[Answers: 8.6%; the event is not at all surprising^ 

The Inclusion - Exclusion Principle 

55. A gambler is playing a sequence of games. For each trial he 
can cnoose to bet either on heads or on tails of a toss of a fair 
coin. If he bets on heads he gains or loses $1 depending on 
whether the coin shows heads or tails respectively. Similarly if 
he bets on tails he gains or loses $2 if the coin shows tails or 
heads respectively. In each trial the gambler chooses one or 
the otner bet at random betting on heads with probability p . 
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Let A be the event that he bets on heads and let B be the 
event that the coin shows heads. Write his net gain in one trial 
using indicators. [Answer: ~ I A I B + 2I A I B ~ 2i A I b] 

56. The President of the U.S. holds frequent news conferences. 
The journalists who attend these conferences are usually tne same 
group, more or less. Let us suppose that in the first two years 
of his term the President answers 400 questions put to him by the 
100 regular journalists. During this time 4 of the journalists 
nave never been accorded recognition. These four get together and 
complain that they are being discriminated against, arguing that 
the probability that none of them was ever recognized is only 

8 x 10~ 8 . On the other hand, the President's Press Secretary 
argues that the probability for four or more of the journalists 
to be ignored is really about 11%, which is not significant 
evidence for discrimination. Who is right? Formulate the two 
models being used and calculate the required probabilities. 
Refer to exercise % t1 for one model. For the other use the finite 
uniform process of sampling with replacement 400 journalists from 
among 100. The latter calculation requires a small computer. 
In Chapter VI we will develop techniques for approximating the 
answer with much less effort. See exercise VI. 31 • 

57. In the game of Treize, popular in seventeenth century France, 
13 balls labelled from 1 through 13 were placed in an urn and 
drawn out one at a time at random without replacement. The players 
bet on the waiting time until either the n th ball drawn was labelled 
n or else the urn was emptied. Compute the distribution of this 
waiting time. Generalize to N balls. 



58. An obvious modification of the game of Treize would be to 
allow players to bet on the number of times that the n th 
ball drawn was labelled n . Compute the distribution of this 
number. Generalize to N balls. What is the average number o 
matches? 

59. A sociologist claims that he can determine a person's 
profession by a single glance. A psychologist decides to test 
his claim. She makes a list of 13 professions and chooses 
photographs (all in a standard pose) of 13 individuals one in 
each profession. She then asks the sociologist to match the 
photos with professions. The sociologist identifies only 5 
correctly. What do you think of his claim? Note that this 
exercise is closely related to exercise 58. 

60* Prove the inclusion-exclusion principle for max and min. 



I 



Hint: I I , , dy = x - c , provided c < x . 

1 (-°°,x) 
c 

* 

61. Return to the molecular beam in exercise 22. Assume the 
beam fires lO 1 ^ ions per second. Compute the distribution of 
the waiting time until the crystal is totally covered. Give a 
formula. Don't try to evaluate it. In exercise VI. 32 we will 
show how to compute an accurate approximation for the value of 
this expression. 

Do the same computation as above for the baseball fan 
problem (exercise 19) and for the pre-emptive nuclear attack 
problem (exercise 20) . 
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62* In a physical configuration there are b bosons, each in 
one of N states. Compute the distribution of the number of 
filled states (i.e. the number of states having one or more 
bosons) . 

63? Tne standard card deck used by ESP experimenters is called 
the Zener deck. It has twenty-five cards, five eacn of five 
kinds. A typical test consists of the experimenter in one room 
and the subject in another. The experimenter shuffles the deck 
thoroughly and then turns the cards over one at a time at a fixed 
rate. Simultaneously, the subject is trying to perceive the 
sequence of cards. In order to test whether the subject's per- 
ceived sequence could have been simply a random guess, we must 
calculate the distribution of the number of matches occurring in 
a random permutation of the deck relative to some standard ordering 
of the deck. 

64* In an ancient kingdom the new monarch was required to choose 
his queen by the following custom. One hundred prospective 
candidates are chosen from the kingdom and once a day for one 
hundred days one of the candidates chosen at random was presented 
to the monarch. The monarch had the right to accept or reject 
each candidate on the day of her presentation. When a given 
candidate is rejected she immediately gets married, and so the 
monarch cannot change his mind. Assume that the preferences of 
the monarch can be expressed in a linear order (from the best to 
the worst) and that the monarch wants the best candidate, second 
best won't do. What strategy should he employ? What is the 
probaDility that he succeeds in his quest for perfection? 
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Suppose that instead of simply ranking the candidates, 
the monarch rates each of them on a to 10 scale, i.e. using 
some sort of objective criteria, he computes a real number between 
and 10 for each. Assume that the ratings are uniformly dis- 
tributed on [0,10]. Again the monarch wants the best candidate 
among the 100. What strategy should he employ now? What is the 
probability of success? 
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Chapter IV Statistics and the Normal Distribution 

The normal distribution arises whenever we make a suc- 
cession of imperfect measurements of a quantity that is sup- 
posed to have a definite value. If all the students in a 
class take the same test, we may think of their grades as 
being imperfect measurements of the average capability of the 
class. In general, when we make a number of independent 
measurements, the average is intuitively going to be an ap- 
proximation of the quantity we are trying to find. 

On the other hand, the various measurements will tend 
to be more or less spread out on both sides of the average 
value. We need a measure for how far individual measurements 
are spread out around the quantity being measured. This 
will tell us, for example, how many measurements must be made 
in order to determine the quantity to a certain accuracy. It 
will also make it possible to formulate statistical tests to 
determine whether or not the data in an experiment fit the 
model we have proposed for the experiment. 
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1. Variance 

The variance of a random variable X is a measure of the 
spread of X away from its mean. 

Definition . Let X be a random variable whose mean is E(X) = m. 

2 

The variance of X is Var(X) = E((X-m) ), if this expectation 
converges. The square root of the variance is called the 



standard deviation of X and is written a(X) = /Var (X) . We 

2 

sometimes write a (X) for Var(X). 

If X is an integer random variable having probability distri- 
bution p = P(X=n), then 



Var(X) = Z(n-m) 2 p n . 
n 

If X is a continuous random variable whose density is 
f(x) = dens (X=x) , then 



Var(X) = 



2 



(x-m) f (x) dx. 



In the continuous case we can imagine that f(x) is the 
density at x of a thin rod. This rod has total mass 1 and 
balances at the mean m. The moment of inertia of this rod 
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about its balance point is precisely the variance. If we 
rotate the rod about m it would have the same angular momentum 
if all the mass were concentrated at a distance a(X) from the 
point of rotation. 

It is possible for a random variable not to have a mean. 
It is also possible for a random variable to have a mean but 
not to have a finite variance. We shall see examples in the 
exercises. However in a great many physical processes it is 
reasonable to assume that the random variables involved do have 
a finite variance (and hence also a mean) . For example, on an 
exam if the possible scores range from to 100, the measure- 
ment of someone's exam score is necessarily going to have a 
finite variance. 

A useful formula for the variance is the following: 

Var(X) = E(X 2 ) - E(X) 2 . 

This is an easy formula to verify. The crucial step is that 
the expectation is additive, even when the random variables 
involved are not independent. 
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Var(X) = E(X-m) 2 ) 

= E(X 2 -2mX+m 2 ) 

= E(X 2 ) -2mE(X)+m 2 

2 2 2 
= E(X )-2ni +nT 

2 2 
= E(X )-ni . 

As we just remarked, the expectation is additive. In 
general it is not multiplicative? that is, E(XY) need not be 
E(X)E(Y). The variance is a measure of the extent to which 
the expectation is not multiplicative when X = Y, for in 
this case we have Var (X) =E (X • X) -E (X) E (X) . The covariance 
of X and Y in general is the difference 

Cov(X,Y) = E(XY) - E(X)E(Y). 

We will not be using covariances except in a few optional 
exercises. Covariances are often used as a measure of the 
independence of random variables because of the following 
important fact: 
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Fact . If X and Y are independent random variables , 
then E(XY) = E(X)E(Y), or equivalently Cov(X,Y) = 0. 

Proof. VJe will only consider the case of integer random 
variables. The proof for the continuous case requires an- 
noying technicalities that obscure the basic idea. We leave 
these as an exercise. 

We compute the distribution of the product XY in terms 
of the distributions of X and of Y using the law of alter- 
natives and the fact that X and Y are independent. 

P (XY=n) = £ P(XY=n|x=k)P(X=k) 
k 

= I P (Y=n/k | X=k) P (X=k) 
k 

= E P(Y=n/k)P(X=k) . 
k 

Therefore the expectation of XY is 

E(XY) = I nP(XY=n) 
n 

= I n Z P(Y=n/k)P(X=k) 
n k 

= Z I nP(Y=n/k)P(X=k) . 
n k 
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Finally change variables to j and k where j = n/k. Then 

E(XY) = Z E jkP(Y=j) P(X=k) 
j k 

= Z jP (Y=j) Zk P (X=k) 
j k 

= E(X)E(Y). This completes the proof. 

We add that it is possible for non-independent random 
variables X and Y to satisfy E (XY) =E (X) E (Y) . As a result the 
covariance is not a true measure of independence. 

Now whereas the expectation is additive whether the 
random variables are independent or not, the variance need 
not be additive in general. The most important consequence 
of the above fact is that for independent random variables 
X and Y, the variance is additive. 

Var(X+Y) =E((X+Y) 2 ) - (E(X+Y)) 2 

= E(X 2 + 2XY+Y 2 ) - (E{X)+E(Y)) 2 

= E (X 2 ) +2E (XY) +E (Y 2 ) -E (X) 2 -2E (X) E (Y) -E (Y) 2 

= E(X 2 ) - E(X) 2 + E(Y 2 ) - E (Y) 2 

= Var (X) + Var (Y) . 
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2 



In terms of the standard deviations, a(X+Y) = /a (X)+a (Y) ; 
the standard deviations of independent random variables act 
like the components of a vector whose length is a(X+Y). 

There are two more properties of the variance that are 
important for us. Both are quite obvious: 

Var (cX) = c 2 Var(X) Var(X+c) = Var(X). 

The first expresses the fact that the variance is a quadratic 
concept. The second is called shift invariance . It should 
be obvious that merely shifting the value of a random variable 
X by a constant only changes the mean and not the spread of 
the measurement about the mean. 



(0) Var(X) = E(X 2 )-E(X) 2 



a(x) = /Var (X) 



(1) If X and Y are independent random variables having 



finite variance, then: 



Var(X+Y) = Var (X) +Var (Y) 



a (X+Y) 



/a 2 (X)+a 2 (Y) 



(2) 



Var(cX) = c Var(X) 



a (cX) 



ca(X) 



(3) 



Var(X+c) = Var(X) 



a (X+c) 



a(X) 



Basic Properties of Variance and Standard Deviation 
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we now compute the variances of some of the random 
variables we have encountered so far in the Bernoulli, Uni- 
form and Poisson processes. 



Bernoulli Process 



Consider a single toss of a biased coin. This is 
described by any random variable X R of the Bernoulli process 



Recall that X r = 



if n th toss is tails 

1 if n th toss is heads 



Since l 2 = 1 and 2 = 0, X 2 is the same as X R . Therefore 

2 2 

the variance of any X n is Var(X R ) = E(X n )-E(X n ) - 



E(X n )-E(X n ) 2 = p-p 2 = pq 



Var(X n ) ^ 
1/4 -- 





The variance of a toss of 
a coin whose bias is p. 



The standard deviation of a 
toss of a coin whose bias 
is p . 



Not 



ice that the largest variance corresponds to a fair coin 
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(p-1/2) . We intuitively think of a fair coin as having the 
most "spread out" distribution of all biased coins; while 
the more biased a coin is, the more its distribution is 
"concentrated" about its mean. 

Next consider the number of successes S n in n tosses of 
a biased coin. Since S R = X 1 +X 2 +...+X n is the sum of n inde- 
pendent random variables all of whose variances are the same 
Var(S n ) = nVarfX^ = npq. If we tried to compute Var(S n ) 
directly from the definition, we get 

Var(S n ) = E((S n -np) 2 ) = £ (k-np) 2 (£) p k q n ~ k - 

That this is npq is far from obvious. 

We leave the computation of the variances of the gaps 
T k and the waiting times as exercises. 

Var(T k ) - %z Var(W k ) = 

P P 

Uniform Process 

Consider a point X dropped at random uniformly on [0,a] 
Clearly the average value of the point is a/2, the midpoint 
[0,a]. The variance is 



Var(X) - E(X 2 ) - E(X) 2 



* ^ 2 fal 2 

x dens(X=x)dx - j^-J 



a 2 1, fa] 2 

3^ 2 
a 



a [3 J Q 



4 



i 3 2 
_ 1 a a 

" a T~ " T~ 
2 

_ a 
~ 17 * 

a a 

So the standard deviation is = - 0.2887a. We can 

/T2" 2/3 

think of this in the following way. Given a uniform bar of 
length a, its midpoint is the center of mass. If the bar 
were set spinning around its center of mass, the angular 
momentum would be the same if the mass were all at a distance 

from the center of rotation. 

2/3 

We leave it as an exercise to compute the variances of 
the gaps and the order statistics of the Uniform process. 
Notice that we cannot use the fact that = L i +L 2 + * * * +L k 

because the gaps are not independent. 
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Vard^) = 



2 

a n 



(n+1) (n+2) 



Var(X (k) ) 



a k (n-k+1) 
(n+1) 2 (n+2) 



We summarize the above comoutations in this table. 



Distribution 



Model (s) 



Expectation 



Variance 



Bernoulli 
Binomial 

Geometric 

Negative Binomial 

Uniform 
Dirichlet 



in the Bernoulli p 
nrocess 

S (X. when n=l) in the np 
n 1 

Bernoulli process 

or any T k in the 1/p 
Bernoulli process 



in the Bernoulli k/p 
process 



X. in the Uniform process a/2 
l 



X,, , (L . when k=l) in the ka/(n+l) 

\ K ) 1 



pq 



q/p' 



kq 



Uniform process 



/ 2 
P 



a 2 /12 

a 2 k (n-k+1) 
(n+1) 2 (n+2) 



Table of Means and Variances 

Standardization 

If we shift a random variable X by a constant, replacing 
X by X+c or if we make a scale change, multiplying X by a 
nonzero constant, we have not altered X in a significant way. 
We have only reinterpreted a measurement of X by a linear 
change of variables. The idea of standardization is to 
choose a single "standard" random variable among all those 
related to one another by a linear change of variables. Then 
in order to determine if two random variables are "essentially" 
the same we should compare their standardized versions. 
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Definition. A random variable X is standard or standardized 
if E(X) =0 and Var(X) - 1. If X has finite variance, then 



X-m 
— r 



where m = E(X) and a = a(X), is standard. We call (X-m) /o 

, . , , , X-m 

the standardization of X. A physicist would say that —g- 
expresses X in "dimensionless units." 

The covariance of the standardizations of two random 
variables X and Y is called the coefficient of correlation 
and is written f(X,Y). It is easy to prove that |^(X,Y)|< 1 
and that f(X,Y) = Cov(X, Y)/ (<r(X Y) ) , and we leave these 
as exercises. Because of the importance of standardization, 
we will prove that the standardization of a random variable 
is really standard. 

Fact. If X has finite variance , then (X-m)/^ is, standard . 



Proof 



E((X-m)/a> = im=« .. »2» = o 



Var ( (X-m)/a) = Var (X-m) (Basic fact 2) 

Var(X) (Basic fact 3) 



a 

1__ 

" 2 
a 



= 1 (since a 2 =Var (X) ) 



We call cj (X) the standard deviation because of its ap- 
pearance in the standardization. We think of a(X) as being 
the natural unit for measuring how far a given observation 
of X deviates from the mean. The importance of standardiza- 
tion will gradually emerge in the next few sections. 



2 « Bell - Shaped Curve 

In this section we introduce one of the most important 
distributions in probability: the normal distribution. The 
traditional explanation for the importance of the normal 
distribution relies on the Central Limit Theorem, which we 
will discuss in the next section. However we feel that the 
explanation, using entropy and information, given in chapter 
VII is better because it provides a context which explains 
the ubiquity not only of the normal distribution but of 
several other important distributions as well. 

Definition . A continuous random variable X is said to have 

the normal or Gaussian distribution with mean m and variance 
2 

a if 

2 2 

, , v . 1 - (x-m) /2a 
dens (X=x) = e 

2 

For brevity we will write simply "X is N(m,a ) " . Some 

2 

authors write N(m,a) instead of N(m,a ); one should beware. 

Unlike most distributions , the formula for the normal 

density comes with the mean and standard deviation already 

2 

specified. We should, however, verify that m and a really 
are the mean and variance. In fact it isn't obvious that this 
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formula actually defines the density of a continuous random vari- 
able. To verify these facts we use the following basic formula, 
which everyone ought to have seen in a calculus course at 
some time: 



r<x> 2 

e dx = AT , 



To prove this let A = 
variable, A = 



f CO 2 

e~ x dx. Then, since x is a dummy 



e~ y dy also. Therefore A = e x dx 

on ) _0O / — C 



e ^ dy 



Now we switch to polar coordinates and integrate: 

2 



A = 



r <M 2 2 

' e ~< x + y >dy 



_co J — oo 



2tt r 00 2 

e~ r rdrde 



d9 



'0 
2tt 



[ [4-" r I 

1 J ' 



r 27T 



de = it. 



Jo 



Hence A = /rr . 

We use this formula first to show that the normal density 
really defines a density. We first change variables to 
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y = (iZEl) go that a/2" dy = dx. Then 
a/2 



2 2 
1 - (x-m) /2a , 
e dx = 



-°°a/2TT 



} -oo a/Tff 



e y a/2" dy 



1_ 



r°° 2 , 
e -Y dy 



= 1. 



Next we compute E(X) when X is N(m,a ) . A.gain we use 
change of coordinates y = 



— — . Note that x - a/2* y+m 



a/2 



E(X) = 



1 -{x-m) 2 /2a 2 dx 
x e ' 

-°° a/Tn 



(a/2y+m) • — — e~ Y a/2" dy 
a/2¥ 



__1 
/rf 



(a/2" y+m) e 



■Y dy 



a/2" f°° 



y e ^ dy + 



m 



/? J- 



-y dy 



- i B e ~ y L /? 



= + m. 
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Finally we leave it as an exercise to show that Var(X) = a . 
It can be done using integration by parts. 

Although these computations look messy, we are in- 
escapably forced to consider this density function because 
this is the distribution corresponding to the concept of 
total randomness or complete randomness . In Chapter VII 

we will make this concept more crecise. So it is important 
that one have an intuitive idea of what it means for a 
random variable to be normal. We suggest that the fol- 
lowing properties of the normal distribution be memorized , 
and one should familiarize oneself with the use of the tables 
giving values of the normal distribution function. 



f (x) * 



f(x) A 




2 2 

N 1 - (x-m) /2a 

f (x) = e 



The standard normal density N(0,1) 



The normal density N(m,cr ) 



The normal density function is symmetric about the 
mean m and the maximum value is taken at the mean. Beyond 
3.5a units from m, the value of the normal density is es- 
sentially zero. The natural unit for measuring deviations 
from the mean is the standard deviation. When x is the de- 
viation from the mean measured in this natural unit, then 
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we get the standard normal density. Most tables for the 
normal distribution are tables of the standard normal density. 




The curve becomes steeper 
and higher at the mean as 
a gets smaller. 



Various normal densities with mean 



In all of the following X is standard normal, H(0,1). 




The area within one standard 
deviation of the mean is 6 8.27% 
of the total area, i.e. 
P(-KX<1) = .6827. 



4.17 




The area within two 
standard deviations of 
the mean is 95.45%, of 
the total area, i.e. 
P(-2<X<2) = .9545. 




The area within three 
standard deviations of 
the mean is 99.73% of 
the total area, i.e. 
P(-3<X<3) = .9973. 



In addition one should memorize the following two cases: 

The area within 1.96 standard deviations of the mean is 0.95 
The area within 2.58 standard deviations of the mean is 0.99 

These will be important when we compute significance levels. 

Occasionally one will see tables of the error function , 
erf(t). This function is closely related to the normal distri- 
bution although it is not the same: 



erf (t) == P(|Y|<t) = 
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t _ 2 

e y dy , 

-t 



where Y has distribution N (0,1/2). If X has the standard 
normal distribution, then 

P (-x<X<x) = erf (— ) 

/2 

and 

P(X<x) = 1/2 + erf(— )/2 . 

fZ 

3. The Central Limit Theorem 

The traditional explanation for the importance of the 
normal distribution relies on the Central Limit Theorem. 
Briefly, this theorem states that the average of n independent 
equidistributed random variables tends to the normal distri- 
bution no matter how the individual random variables are dis- 
tributed. The explanation for the ubiquity of the normal 
distribution then goes as follows. Suppose that X is the 
random variable representing the measurement of a definite 
quantity but which is subject to chance errors. The various 
possible imperfections (minute air currents, stray magnetic 
fields, etc.) are supposed to act like independent equi- 
distributed random variables whose sum is the total error 
of the measurement X. Unfortunately this explanation fails 
to be very convincing because there is no reason to suppose 
that the various contributions to the total error are either 
independent or equidistributed. We will have to wait until 
Chapter VII to find a more fundamental reason for the 
appearance of the normal distribution. The explanation 

given there uses the concepts of entropy and information. 
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Intuitively, the sum of independent, equidistributed random 
variables is progressively more disordered as we add more 
and more of them. As a result the standardization of the sum 
necessarily approaches having a normal distribution as n+«. 
This tendency to become disordered is exhibited even when the 
random variables are not quite independent and equidistributed. 
It is this tendency that accounts for the ubiquity of the 
normal distribution. 

Nevertheless the Central Limit Theorem is of importance 
in probability and statistics, particularly in the theory of 
hypothesis testing which we will be discussing in the next 
section. Moreover the proof of the Central Limit Theorem is 
more difficult than our intuitive justification would lead 
us to believe. We will now give a precise statement of this 
theorem. The proof is sketched in section 6. 

Suppose that X ± , X 2 , . . . ,X R , . . . are independent equidistri- 
buted random variables whose common mean and variance are 
m = E{X L ) and a 2 = Var (x ± ) . Let S R be the sum X 1 +...+X n . 
Then the mean of S R is E(S R ) = EiX^ + E (X 2 ) + . . . +E (X R ) = nm, 
and its variance is Var (S R ) - Var (X^ + Var (X 2 ) + . . . +Var (X r ) =n 
since the X^s are independent and equidistributed. Therefore 
the standard deviation of S R is a(S R ) = /Var (S R ) = /n"a. 
Hence the standardization of S R is 

S -nm X n +X +. . .+X -nm 

Y = J} = 1 2 n - 

n a/n a/n 

The Standardization of a Sum of Independent, Equidistri- 
buted Random Variables whose common mean and variance 
o 

are m and a respectively. 
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This is an important formula to remember. The Central Limit 

Theorem then says that Y tends toward the normal distribution 
J n 

as n-*-°° . 



Central Limit Theorem. If X^- , * . . . »X , . . . are independent 
equidistributed random variables with mean m and variance 
, then 



X,+X 9 +. . ,+X -nm , 

P(Y n <t) = P( — - < t) - ~ 

a/n /2tt 



ft -x /2 

e dx 

— 00 



as n -»■ » . 



For example, in the Bernoulli process the random 

variable S n is the sum of independent equidistributed random 

variables X. whose common mean is m - p and whose common 
1 c 



variance is a 



pq 



Then Y = 
n 



n 



tends toward the 



npq 



standard normal distribution. That is, is approximately 
distributed according to N(np,npq). This approximation is 
surprisingly accurate even for small values of n. 




The distribution of S 



in the Bernoulli process 
using a fair coin. 
Superimposed is the 
normal distribution 
N(2,1 ) 
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For example, we know that P(1 < < 3) = 1 ~ J£ = 0.875- On 

the other hand, when X is N(2,1 ) , P(0.5 < X < 3-5) - 0.8664 

(since 0.5 and 3-5 are each 1.5a from the mean 2). You can see 
that the fit is quite close. 

The two most common manifestations of the Central Limit 
Theorem are the following: 

2, 

(1) As n+°°, the sum S n "tends" to the distribution N(nra,no ) 

(2) As n+°°, the sample average in=S n /n "tends" to the 

2 

distribution N(m,a /n) . 

The expectation of the sample mean is the mean as we already 
noticed: 

S 

,— , r nm 
E(m) = E(— j = — = m, 

■hence the sample mean is an approximation to the true 
mean m. The spread of the sample mean depends on the 

2 

variance Var (in) = Var(S n /n) = Var(S n )/i 2 = na 2 /n 2 - — . 

Intuitively it is clear that as n+«>, the sample average will 
be a better and better approximation to the mean m. The 
Central Limit Theorem tells us precisely how good an approxi- 
mation it is. In the following drawings we assume m = 0. 



As n-»-«>, the distribution of 

S tends to broaden. The 
n 

n=1 spread of S n is proportional 




-2a -a a 
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We remark that the independence of the random variables 
is essential in the Central Limit Theorem. For example, the 
gaps of the Uniform process are equidistributed , but their 
sum L-^+L2+ . . • +L n+ -^ is the length of the interval, which we 
know with certainty. 

Statistical Measurements 

Suppose we make n measurements , . . . rX n of the same 

quantity. Implicitly we are assuming that these measurements 
are equidistributed and independent random variables. Each 
measurement has a distribution whose mean is the quantity we 
wish to measure. But the measurements are imperfect and so 
tend to be spread to a certain extent on both sides of the mean. 
Statisticians refer to this situation as a "random sample." 

Definition . A random sample of size n is a set of n inde- 
pendent, equidistributed random variables , . . . , X . 
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In the next two sections we will consider the problem 
of measuring the mean of a distribution using random samples, 
in particular we would like to know how small a random sample 
is sufficient for a given measurement. If we wish to determine 
the average number of cigarettes smoked per day by Americans, 
it would be highly impractical to ask every American for this 
information. Statistics enables one to make accurate measure- 
ments based on surprisingly small samples. 

In addition to the measurement problem, we will also 
consider the problem of using a random sample as a means for 
making predictions of the future. The prediction will, of 
course, be a probabilistic one: with a certain probability 
.the next measurement will lie within a certain range. For all 
the statistical problems we will study, we will assume only 
that the variance of each measurement X ± is finite. In most 
cases this is a reasonable assumption especially if the measure- 
ments lie in a finite interval. For example, the number of 
cigarettes smoked by one individual in one day is necessarily 



between and 10^. 



The general procedure can be summed up in the following 

rule . 



1 Main Rule of Statistics . In any statistical measurement we 

may assume that the individual measurements are distributed 

2 

according to the normal distribution N(m, a ) . 
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To use this rule we first find the mean n and variance 

a 2 from information given in our problem or by using the 

_2 

sample mean m and/or sample variance c defined below. We 
then standardize the random variables required in the problem. 
Finally we use tables of the standard normal distribution to 
solve the problem. We will see many examples of this basic 
procedure. As stated, the main rule 

says only that our results will be "reasonable" if we assume 
that the measurements are normally distributed. We can 
actually assert more. In the absence of better information, 
we must assume that a measurement is normally distributed. 
In other words if several models are possible, we must use 
the normal model unless there is a significant reason for 
rejecting it. 

2 

When the mean m and/or the variance a of the measurements 
X i are not known, the following random variables may be used 
as approximations. 

The sample mean m = (X-, +X 2 + . . .+X n ) / n approximates m. 

_ 2 (X 1 -m) 2 + (X 2 ~ia) 2 + . . .+ U n -ia) 2 
The sampl e variance a = ~n^l 

2 

aDDrnximates the variance c . 

For example an exam graded on a scale of 0-100 is 
given in a class of 100 students. The sample mean is found 
to be 81 with sample variance 100 (standard deviation 10) . 
Based on this data, we can predict that if the exam is given 
to another student, the student will score between 61 and 100 
with probability 0.95 (within 2a of in) . In actual exam sit- 
uations, the distribution of an individual exam score is more 
complicated than the normal distribution, but in the absence 
of any better information we follow the Main Rule. 



When the mean m is known but the variance is not, there in a 
slightly better approximation to the variance: 
The sample variance ( when m is_ known) 

_2 (Xj-in) 2 + (X 2 -m) 2 + . . . + XX n -m) 2 
a = — 

2 

also approximates the variance o . 
The reason for the different denominators in the two expressions is 
subtle. We leave it as an exercise to show that the expecta- 
tions of the random variables are 

- 2 2 

E (m) = m and E ( c ) = a , 

where second equation holds for either sample variance. The distri- 

_ J- 

butions of the random* variables m and c are very important in sta- 
tistics and we will undertake to compute some cases, leaving 
the rest as exercises. 

4 . Significance Levels 

Let us begin with an example. We are presented with a 
coin having an unknown bias p. We are told that the coin is 
fair, but we are suspicious and would like to check this as- 
sertion. So we start tossing the coin. After 100 tosses we 
qet only 41 heads. Do we have reason to suspect that the 
coin is not fair? 

In such an experiment, we carefully examine the model we 
have postulated in order to determine what kind of behavior 
is consistent with the model. If the observed behavior is 
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consistent with the model we have no reason to suppose that 
the coin is unfair. In this case the postulated model is the 
Bernoulli process with bias p = 1/2. 

The average value of S 1Q0 is 50 for our postulated 
model. We are interested in the possible deviation of S 10Q 
from its mean value 50, because very large deviations are 
unlikely in the model but would not be if the coin is unfair. 
The usual statistical procedure in such a case is to deter- 
mine precisely how large a deviation from the mean is 
reasonable in the model. Since S^qq has the binomial distri- 
bution, we could in principle do this using only the formula 
for this distribution. However the computation is ex- 
tremely difficult. On the other hand, we know that S 100 is 
very close to having the normal distribution N(50,25). In 
other words, (S 10 q-50)/ 5 has approximately the standard 
normal distribution. 

We now look in a table of the standard normal distri- 
bution. There we find that 

S 100" 50 

P(-1.96 < 1 1.96) = 0.95, 

or 

P(40.2 < S 10Q < 59.8) = 0.95 . 

Since 41 falls in this range, we conclude that our suspicions 
about the unfairness of the coin are groundless. The dif- 
ference 1-0.95 - 0.05 is called the significance level of our 
test. We then say "the experiment has no significance at the 
0.05 level." Notice that we say no significance. Statis- 
tically speaking, a significant result occurs only when a 
postulated model is rejected. 



Looking at the reasoning a bit more carefully, we have 
said the following. Assuming that the coin is fair, about 
95% of the time we will get between 40 and 60 heads when we 
toss the coin 100 times. But 5% of the time we will not be 
within this range. 



The significance level represents the 
probability that we will reject the 
postulated model even though this 
model is correct. 

Notice the indirectness of this kind of reasoning. We say 
nothing about whether or not the coin is really fair or un- 
fair, or even that it is fair or unfair with a certain proba- 
bility. Statistics never tells one anything for certain, 
even in the weak sense of probabilistic certainty. All we 
can do is devise tests for determining at some significance 
level whether or not the data we have collected are consis - 
tent with the model. Because of the abbreviated terminology 
that statisticians and scientists frequently use when discus- 
sing the result of an experiment, one should be careful not 
to ascribe properties to statistical statements, which they 
do not possess. 
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The 0.05 significance level is so commonly used by 
statisticians and scientists that this level is assumed when 
no significance level is specified. The 0.01 significance 
level is also common, and an experiment is said to be very 
significant if this level is being used. For example, in 
our coin tossing test we found that getting 41 heads was 
not (statistically) significant. On the other hand, getting 
39 heads would be significant but would not be very significant, 
while getting 35 heads would be very significant. 

It is' important to point out that the choice of a signi- 
ficance level is part of the design of one's experiment. It 
cannot be "calculated" after the data are collected. Doing 
so is intellectual and scientific dishonesty of the worst 
kind, for if one does this consistently it violates the 
whole statistical framework within which the scientific com- 
munity works. Generally speaking, the choice of a significance 
level is determined by considerations having nothing to do 
with probability or statistics. For example, if one is 
testing to see if a certain commonly used chemical could be 
a cause of a disease, we would certainly want a very sig- 
nificant result before recommending that the chemical be 
banned, with all the political and economic repercussions 
that such a decision could have. 

4.29 



Lot us consider another example. We ar* given a die, 
and we wish to test whether it is loaded. We decide to 
consider whether "3" is special, and we choose to work at 
the 0.05 significance level. Our experiment consists of 
rolling the die 120 times, and we find that "3" comes up 25 
times. Our postulated model is now the Bernoulli process 
with bias p = 1/6. The mean and variance of a single roll 
are m = p = 1/6 and a 2 = pq = 5/36. Therefore the number of 
threes, S 12Q , is approximately N(120/6, 120-5/36) = N(20, 100/6), 
and hence (S 12Q -20) /^/10 is approximately N(0,1). Our experi- 
ment is significant at the 0.05 level only if | (S 12Q -20) /5*/10 | >1 .96 . 
In our case S 12Q = 25 so | (S 12Q -20) /T/10 | = 1 5/5"/ 10 | = /372". 
This is not larger than 1.96. Therefore the experiment is 
not significant, and we have no reason to suspect that the die 
is loaded. 

R ule of Thumb 

A quick rule of thumb for testing the Bernoulli process 
(to be used only if one is in a hurry) is the following. If 
one tosses n times a coin with bias p, then the result is 
significant if the number of heads lies outside np + 2/npq 
very significant ' * " " n P + 
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5 . Confidence Interval s 

The concept of a confidence interval is a variation 
on the statistical themes we have just been describing. In- 
stead of testing a hypothesis, one is interested in accuracy 
of a measurement or in prediction of the future. 

Let us consider a very simple example of the prediction 
of the future. Suppose we have two competing airlines on a 
given route, both having the same departure time. Suppose 
that every day exactly 1000 passengers show up and that each 
one chooses one or the other airline with probability 1/2, 
independently of the other passengers. Both airlines want 
to be able to accommodate as many passengers as possible. 
They could do this, of course, by providing 1000 available 
seats. Needless to say this would be disasterously expensive, 
particularly since the probability that all 1000 seats would 
ever be needed is essentially zero. By providing 1000 seats 
we would have absolute certainty that there will never be an 
overflow/ if we are willing to accept a 5% chance of an over- 
flow, the number of seats we must provide decreases dramatically. 

To compute this we again use the normal approximation of 
the binomial distribution. The model we are using is the 
Bernoulli process with bias p = 1/2 . The variance of a 
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single toss is 1/4. The number of passengers choosing one 
particular airline is S^qoO' w ^^ c ^ ^ as approximately the 
distribution N(500 ,250) . Hence (S 100Q -500 ) /5/TO is almost 
N(0,1). We now look up in a table of the standard normal 
distribution that number t for which 

:>(Y£t) =0.95 

We find that t = 1.6 45. This tells us that 

P( (S 1000 -500)/5/l0 <_ 1.645) = 0.95 
or P(S 100Q <526) = 0.95. 

We need only provide 526 seats to have 95% confidence of not 
having an overflow. This is quite a dramatic drop from 1000 
seats. Even for 99% confidence we need only a few more seats 

P( (S 1000 -500)/5/T0 < 2.33) = 0.99 



or P(S 1000 - 537) = °' 99 * 
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We speak of the interval [0,526] as being a 95% 
confidence interval for S 1Q00 - In general any interval 
[a,bj for which P(a<X<b) = .95 is called a 95% confidence 
interval for the random variable X. When X is normally dis- 
tributed {or approximately normally distributed) with dis- 
tribution N(m,a 2 ), we generally use either a one- sided con- 
fidence interval or a two -sided confidence interval . A 
one-sided interval is of the form (~«>,t] or of the form [t,«) . 
A two-sided interval is chosen to be symmetric about the mean: 
[m-t,m+t] . When testing statistical hypotheses, one uses 
either a one-sided or a two-sided confidence interval. The 
corresponding tests are then referred to as a single - tail or a 
double - tail test respectively. 

Now we consider the problem of the accuracy of statistical 
measurements. Suppose we wish to determine the percentage of 
adult Americans who smoke. To find out this number we ran- 
domly sample n persons. How many persons do we have to 
sample in order to determine the percentage of smokers to 
two decimal place accuracy? Of course, we can determine this 
percentage to this accuracy with absolute certainty only by 
asking virtually the whole population, because there is 
always the chance that those not asked will all be smokers. 
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Therefore we must choose a confidence level. The usual 
level is 95% so we will use this. 

The model we are using is the Bernoulli process with 
bias p, where p is the percentage we are trying to compute. 
Each person we ask will be a smoker with probability p. If 
we randomly ask n persons, the number who smoke divided by 
n will be an approximation p to p. This number is the sample 
mean in = S n /n. We use the Central Limit Theorem in its 

second manifestation. We find that p=m has approximately 

2 2 
the distribution N(p,a /n) , where a = pq. Therefore 

(m-p)/ (a//n) is approximately N(0,1). We require a two- 
sided interval in this problem: 

P(-1.96< (in-p)/(a//n) < 1.96) = 0.95 

or P(.|m-p| < 1.96a//n) = 0.95 

We want to choose n so that |m-p| 0.005 in order to have 

5 2 

two place accuracy. That is, 1.96a//n - 0.005 or n a (l. 54x10 )a . 

2 

Unfortunately to compute a =pq we must know p. However we 
2 

know that a takes its largest value when p=q=l/2. Therefore 
n< (1.54xl0 5 ) (0.25) -3.85xl0 4 . In other words, to determine 
the percentage of smokers with 95% confidence we must sample 
up to 38,500 persons. 
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In practice one would first determine p to one decinal place 
accuracy. This requires only a sample of 385 persons. Using 
this number, one can compute a more precisely. Using this 
better value of a we can determine more precisely how many 
persons must be sampled in order to find p to two decimal 
place accuracy. For example suppose that with the smaller 

2 

sample we find that p = 0.65 + 0.05. The worst case for a 
is now pq=(0.6) (0.4) = 0.24. We must then sample n-3 7,00 
persons to determine p to within 0.005. 

One must be careful not to confuse the accuracy with 
the confidence. The accuracy tells us how accurately we think 
we have measured a certain quantity. The confidence tells us 
the probability that we are right. To illustrate the dis- 
tinction between these two concepts we consider the above 
measurement problem with two accuracies and three confidence 
levels. In general, improving confidence does not require 
much more effort while increasing accuracy requires a great 



deal of additional effort. 






Accuracy 






0.05 (one decimal place) 


0.005 (two decimal places) 


1 

95% 


385 


38,500 


Confidence 


667 


66 ,700 


99 .9% 


1,089 


108^900 



The number of individuals that must be sampled to determine 
the percentage having a certain property (in the worst case 
P=l/2) 
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In the exercises we consider more examples of significance 
levels and confidence intervals. Some of these have a distinct 
air of the supernatural about them. How for example is it pos- 
sible to make conclusions about the television preferences of 
a population of 200 million persons based on a sample of only 
400 of them? In fact the size of the population is irrelevant 
to the statistical analysis. (It arises only when one con- 
fronts the problem of making a random sample from a very large 
population. This is a very difficult problem for statisticians.) 



6. The Proof of the Central Limit Theorem 

If you are familiar with the concept of the Fourier transform, 
the proof of the Central Limit Theorem is not very difficult to 
understand. We will sketch the proof leaving the details as an 
exercise . 

Let X^,X2»... be a sequence of independent equidis tributed 

random variables having finite variance. Without loss of generality, 

we may assume that they are standard. Set S = X.. + X„+. . .+X . 

n 1 I n 

We wish to show that the distribution of S /y^n tends to the 

n 

standard normal distribution. 

Recall that the Laplace transform of a function f(x) is 
defined to be the function <j>(x) = /~ e~ Ax f (x)dx, defined for 
A >_ when f (x) is the density of a random variable. If we 
replace the nonnegative parameter A by a purely imaginary 
one, it;, for -<» < £ < we obtain a transform known as the 
Fourier transform; 



iKc) - e lcx f(x)dx, 



defined for all real numbers t; . By deMoivre ' s theorem, 
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e 1 ^ = cos(cx) + i sin(£x), so that i>(0 may be written as 



cos (£x) f (x)dx + i sin(cx)f (x)dx, 



where each of the two integrals are real. When f(x) is the 
density of a random variable X, we say iKO is the characteristic 
function of X . 

We begin by calculating some values of the characteristic 
function ty(0 • At zero we get 



1^(0) = I e°f(x)dx 



f (x)dx = 1 , 



since f(x) is a density. Similarly, the values of the derivatives 
of ty(0 at zero can be computed by "differentiation under the 
integral sign . " 



* (n) (?) 



— ~ e s f (x)dx 

oo d ^ 



(ix) n e i?X f (x)dx, 

i -co 



so that 



* (n) (0) = (i) n 



x n e 1 ^ X f (x)dx 



- (i) n E(X n ) 



In other words, 4j n (0) is (i) n times the n moment of X. 
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If we assume that X has finite variance, then E(X) and 

2 

E(X ) exist, and we may apply the Taylor expansion theorem to 
conclude that 



iKe) = *(0) + r(0)c + %^"(0)c 2 + o(c 2 ), as e - o. 



If X is standard, then 



<Ke) = 1 - + o(c 2 ), as c - o. 



The Fourier transform satisfies the same convolution property 
as the Laplace transform: 

Fact . If i|> x (c) and ^y(c) are the characteristic functions of 
random variables X and Y and if X and Y are independent, 
then 4> X+Y (0 = ^ x (cH Y U) is the characteristic function of X + Y 

Proof . We may write the characteristic function \p x (0 of X as 



e^ x f(x)dx = E(e icX ) 



By the multiplicative property of expectations of independent 
R.V. 's, 

* x+y ( C > = ECe 1 *^) = E(eW Y > - E(e^ X )E(e^ Y ) 



X^^Y 
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Therefore, if X-^X^... is a sequence of independent equi- 
distributed standard random variables, having characteristic 
function iKO , then their sum S n has characteristic function 
ijj(C) n . By a change of variables, the random variable S n / Jn 
has characteristic function ip(s/vfi) n . Utilizing the Taylor 
expansion computed above, we find that 

2 

4>(U^) U = (1 - k— ) 2 + P(— )) n as - 

/n n vn 



= (1 - i- ^ 2 + o(i-)) n as n - co 

\ 2n n 

(with l, fixed) 



Now we know from calculus that 



2 2 

(1 . S / 2 ) n _> e' 5 /2 , as n - co, 



and we leave it as an exercise to show that this also works 

when we have the extra o(~) term. Therefore, the characteristic 

n 2 

-r 12 

function of the standardized sum S / 'v'n approaches e 
as n -»■ 0°, for every fixed £. 



We now suspect that e~ ? ^ 2 is the characteristic function 
of the standard normal distribution. This can be proved a number 
of ways. One could first show that the convolution of normal 
distributions is normal so that if X^X^... are all standard 
norma 1 distributions then so is S n / v'n '. It then follows by the 
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above result that the characteristic function of the standard 

? 

-r/2 

normal distribution must be e ^ ' . One can also compute this 
characteristic function directly by differentiating under the 
integral sign and using an integration by parts. We leave this 
as an exercise. 

The Central Limit Theorem follows from the above calculation 
and the following two properties of the characteristic function: 

Property 1 (Fourier inversion). D ifferent probability distributions 
have different characteristic functions . 

Indeed, if ip C C ) has the property that /"^ | ^(c) | dc < °°, then 
one may use the Fourier inversion formula to compute f (x) in 
terms of ip(c) : 



f(x) - ^ 



Property 2 (Continuity) . If a sequence of characteristic functions 

, ^2 » • • . converges to a characteristic function ijj in the sense 
that for all C > 



lim i[> n (0 = <KC) , 



then the probability distributions corresponding to the $ n (0 
converge to that of ifj(c) ^JL n + 
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7 • T ^ e Law of Large Numbers 

The law of large numbers is the statement that is often 
taken as justification of the definition of probability in 
terms of frequency. For example, what does it mean to say 
that the probability is 1/2 for getting a head when a fair 
coin is tossed? In the frequentist point of view, one says 
that this means the proportion of heads in a very large number 
of tosses will be very close to 1/2. But this is really beg- 
ging the question in some sense as we will see. 

Let X 1 ,X 2 ,... be independent equidistributed random 

2 

variables with common mean m and common variance o <°°. We 

would like to say that (X 1 +X 2 +. . ,+X n ) /n approaches m as n-*-~. But these 

are random variables so we can only speak of the probability 

that the limit is m. 

The Law of Large Numbers 

X +X 9 +. . ,+X 

P(Um — — - = m) = 1. 

n 

n-»-oo 

This is essentially just a psychological theorem, for it 
does not provide the information necessary for concrete ap- 
plications. The Central Limit Theorem is far more useful, 
and in fact the law of large numbers is a consequence of 
the Central Limit Theorem. We leave the proof as an exercise. 

In any case the law of large numbers is a purely mathe- 
matical theorem. In order for it to make sense we must al- 
ready have the concepts of probability, random variables, 
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means, variances, etc. We cannot use this as the definition 
of probability. But we cannot even use the law of large 
numbers as a justificatio n of the frequentist point of view. 
This point of view says that probabilities represent a 
physically measureable quantity (at least in principle) . 
But there is no concept of a physical "measurement" cor- 
responding to the mathematical concept of the limit 

„ . 12 n 

» im 

n-*-°° 

The relationship between physical experiments and the theory 
of probability is much more subtle than the frequentist 
point of view would have one believe. 

The law of large numbers is not very useful in applications 
because it does not specify how large a sample is required to 
achieve a given accuracy. However it does have interesting 
theoretical applications. We will see one in section VII. 2 (the 
Shannon Coding Theorem). Another theorem which has great usefulness 
in probability theory is the Bienayme-Chebyshev Inequality. Its 
importance stems primarily from its simplicity. 
Bienayme -C hebysaev Inequality Let X be a random variable with 
mean E(X) - m and variance Var(X) = & , then for all t *>(), 

P(|X - m| > t) < cr 2 /t 2 . 

Proof 

Suppose that X is a continuous R.V. with density f(x). The 

proof in the case of an integer R.V. is similar. Clearly we may 

assume that m is zero, for if not we just replace X by X - m. 
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<T^ - Var 



= t 




V~f (x)dx 



a consequence of the fact that x > t in the domain of integration. 
If we now solve for P( |X( > t) we get the desired inequality. 

The last result we will consider in this section is one of 
the most astonishing facts about probability: the Kolmogorov 
Zero-One law. As with the other theorems in this section it has 
little practical usefulness, but it has many theoretical applic- 
ations. The law of large numbers, for example, can be proved 
using it. 

Suppose that X^X^,... is a sequence of random variables 
which are independent but not necessarily equidistributed. 
A tail event A is an event such that 

(1) A is defined in terms of the random variables X 1 ,X 2 ,... 

(2) A is independent of any finite set of the X i , s,i.e. 



P(A| (X 1 =t 1 ) n(X 2 =t 2 )n 



. ..n(X n =t n )) == P(A) 



for any n<<» and any set of t. "s. 



Kolomogorov Zero-One Law 



If A is a tail event, then P(A)=0 



or 1. 



At first it seems that there cannot be any tail events 



except for ft and Q c , because tail events seem both to depend on the 
X. 's and not to depend on the X i ' s . However there are, in 



fact, many nontrivial examples. Here is one. Toss a fair 

th 

coin infinitely often, and write X^ = +1 if the n toss is heads 

4.1. 

-1 if the n toss is tails 
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oo X' 

Now let A be the event "I — converges." This is a tail 

n=l n 

event because the converge or divergence of a series is 

determined by the terms of the series but is independent of 

any finite set of them. We all should lenow at least two ex- 

00 1 m ( i\ n 

amples from calculus: £ — diverges but I V } con- 

n-1 n=l 
verges to An (2) . What we are doing is to change the signs of 

00 

the harmonic series E — randomly and independently. P (A) 

In 

is the probability that a random choice of signs yields a 
convergent series. The zero-one law tells us that P (A) can 
only be or 1; there are no other possibilities. In fact 
P (A) =1; we leave this as an exercise. 

As another example , suppose that a monkey is trained to 
hit the keys of a typewriter and does so at random , each key 
having a certain probability of being struck each time, in- 
dependently of all other times. Let A be the event "the 
monkey eventually types out Shakespeare's Hamlet. " ^gain 

this is clearly a tail event and so P (A) = or 1. This is 

5 

easy to see. Hamlet has about 2*10 characters and could be 
written with a typewriter having 100 keys. Suppose each key 
has probability .01 of being typed. The probability of 
typing Hamlet is p = (.01) x .during a given "session" of 
2xl0 5 keystrokes. The probability of not typing Hamlet in 
one session is q = l-p<l. The probability that in infinitely 
many sessions the monkey never types out Hamlet is £im q n =0 . 



Therefore P (A) - 1. On the other hand, the expected waiting 

_ rt 400,000 . 

time until the monkey types out Hamlet is about 10 Key- 
strokes. If the monkey could type one keystroke every 
nanosecond, the expected waiting time until the monkey types 
out Hamlet is so long that the estimated age of the universe is 
insignificant by comparison. 

Needless to say this is not a practical method for writing 
plays. The Kolomogorov zero-one law has little practical useful- 
ness. But it does have theoretical uses, and it shows how 
counter-intuitive probability theory can be. 
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Normal Distribution Function 
Values of F(x) = 1 ( e" x /2 dx = P(X<x). where 
X is normally distributed with mean and variance 1 



X 


.00 


.01 


.02 


.03 


.04 


.05 


.06 


.07 


.08 


.09 


.0 
.1 
.2 
.3 
.4 


.5000 
.5398 
.5793 
.6179 
.6554 


.5040 
.5438 
.5832 
.6217 
.6591 


.5080 
.5478 
.5871 
.6255 
.6628 


.5120 
.5517 
.5910 
.6293 
.6664 


.5160 
.5557 
.5948 
.6331 
.6700 


.5199 
. 5596 
.5987 
.6368 
.6736 


.5239 
. 5636 
.6026 
.6406 
.6772 


.5279 
. 5675 
.6064 
.6443 
.6808 


.5319 

.3/14 

.6103 
.6480 
.6844 


.5359 

C7C'1 
, J/JJ 

.6141 
.6517 
.6879 


.5 
.6 
.7 
.8 
.9 


.6915 
.7257 
.7580 
.7881 
.8159 


.6950 
.7291 
.7611 
.7910 
.8186 


.6985 
.7324 
.7642 
.7939 
.8212 


.7019 
. 7357 
.7673 
.7967 
.8238 


.7054 
. 7389 
.7704 
.7995 
.8264 


.7088 
. 7422 
.7734 
.8023 
.8289 


.7123 
. 7454 
.7764 
.8051 
.8315 


.7157 
. /4oo 
.7794 
.8078 
.8340 


.7190 

7trl 7 
. / J I 1 

.7823 
.8106 
.8365 


. 7224 

, /3*+*7 

.7852 
.8133 
.8389 


1.0 
1.1 
1.2 
1.3 
1.4 


.8413 
.8643 
.8849 
.9032 
.9192 


.8438 
.8665 
.886< 
. 9049 
.9207 


.8461 
.8686 
.8888 
.9066 
.9222 


.8485 
.8708 
.8907 
.9082 
.9236 


.8508 
.8729 
.8925 
.9099 
.9251 


.8531 
.8749 
.8944 
.9115 
.9265 


.8554 
.8770 
.8962 
.9131 
.9279 


.8577 
.8790 
.8980 
.9147 
.9292 


.8599 
,8810 
.8997 
.9162 
.9306 


.8621 
. 8830 
.9015 
.9177 
.9319 


1.5 
1.6 
1.7 
1.8 
1.9 


.9332 
.9452 
.9554 
.9641 
.9713 


.9345 
.9463 
.9564 
.9649 
.9719 


.9357 
.9474 
.9573 
.9656 
.9726 


.9370 
.9484 
.9582 
.9664 
.9732 


.9382 
.9495 
.9591 
.9671 
.9738 


.9394 
.9505 
.9599 
.9678 
.9744 


.9406 
. 9515 
.9608 
.9686 
.9750 


.9418 
. 9525 
.9616 
.9693 
.9756 


.9429 

AC or 

. 9535 
.9625 
.9699 
.9761 


.9441 

AC/, C 

.9633 
.9706 
.9767 


2.0 
2.1 
2.2 
2.3 
2.4 


.9772 
.9821 
.9861 
.9893 
.9918 


.9778 
.9826 
.9864 
.9896 
.9920 


.9783 
.9830 
.9868 
.9898 
.9922 

.9941 

OQ 5£ 

.9967 
.9976 
.9982 


.9788 
.9834 
.9871 
.9901 
.9925 

.9943 

.9968 
.9977 
.9983 

.9988 
.9991 
.9994 
.9996 
.9997 


.9793 
.9838 
.98 75 
.9904 
.9927 

.9945 
. ~ ~ j ~ 
.9969 
.9977 
.9984 

.9988 
.9992 
.9994 
.9996 
.9997 


.9798 

no /. O 

. 9o4Z 
.9878 
.9906 
.9929 

.9946 
, 9960 
'.9970 
.9978 
.9984 


.9803 

no /. (L 

. 9o4 o 

.9881 
.9909 
.9931 


.9808 

no tr n 

.9884 
.9911 
.9932 


.9812 

QQCA 

• 7Ujt 

.9887 
.9913 
.9934 


.9817 

.9890 
.9916 
.9936 


2.5 
2 . 6 
2.7 
2.8 
2.9 


.9938 
. 995 3 
.9965 
.9974 
.9981 


.9940 

, 9933 

.9966 
.9975 
.9982 


.9948 
. 9961 
.9971 
.9979 
.9985 


.9949 
.9962 
.9972 
.9979 
.9985 


.9951 
.9963 
.9973 
.9980 
.9986 


.9952 
.9964 
.9974 
.9981 
.9986 


3.0 
3.1 
3.2 
3.3 
3.4 


.998 7 
.9990 
.9993 
.9995 
.9997 


.9987 
.9991 
.9993 
.9995 
.9997 


.9987 
.9991 
.9994 
.9995 
.9997 


.9989 
.9992 
.9994 
.9996 
.9997 


.9989 
.9992 
.9994 
.9996 
.9997 


,9989 

.9992 
.9995 
.9996 
.9997 


.9990 
.9993 
.9995 
.9996 
.9997 


.9990 
.9993 
.9995 
.9997 
.9998 
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8. Exercises for 

Chapter IV Statistics and the Normal Distribution 
Variance 



1. Suppose that X is a random variable whose density is 
dens(X=x) = f(B-l)x~ B if x > 1] , where 3 > 1- Show: 




if x > 1| , 
if x < 1 ) 



(a) X has neither mean nor variance if 1 < 3 <_ 2 

3-1 



(b) X has mean f^i if 3 > 2, but has no variance if 



2 < 3 < 3. 

(c) X has variance if 3 > 3. 

(B-2) Z (3-3) 

2. Let X be the random variable in the symmetric Bernoulli 

process random walk model. Let Y = X 2 . Then X and Y are obviously 

dependent random variables. Show that Cov(X,Y) = 0. Hence the 
converse to the Fact in section IV,i is false. 



3. Verify the following formula, which holds for arbitrary random 
variables X^,X2»...,X (not necessarily independent) so long as 
both sides exist: 



Var(X-,+X +. . ,+X ) = 5" Var(X.) + 2 £ Cov(X.,X.). 
v 1 2 n ? l . ~ . lj 

1 KJ J 

Use this formula and the exchangeability of the gaps in the uniform 
process to compute Var(X^^). 
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4. Prove that if X and Y are independent continuous R.V.'s, 
then E(XY) = E(X)E(Y). To do this one must split X into 
positive R.V.'s X + and X" such that X = X + - X". For example 
define X + by X + = (x if X > 

\0 if X < 0. 

Do the same for Y. Note also that dens (XY = z I X = x) 4 dens (Y = -I X = 

x 



5. Prove that for any two random variables X and Y , 

E(XY) < Je(X 2 ) E(Y 2 ) (Schwartz Inequality). Use this to 

prove that the correlation coefficient of X and Y satisfies 

Cov ( X , Y ) 



p(X,Y)| < 1 . Also show that p(X,Y) = 



a (X)a (Y) 



6. Let X^,X2,...,X be a set of independent random variables 

not necessarily equidis tributed but all having the same mean and 
2 

variance a . Prove that the sample mean has expectation m and 

2 

that both sample variances have expectation a . Also compute 
the variance of the sample mean. Can you compute the variance 
of either sample variance? 

Not ice that the denominator must be different for the two 
sample variances . This denominator is called the number of 
degrees of freedom by statisticians. Intuitively, each estimation 
of a parameter of the unknown distribution causes a loss of one 
degree of freedom in the random sample. 
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7. In exercise III. 10 we saw that if X and Y are independent 
then g(X) and h(Y) are also independent for any two functions 
g and h . It follows that g(X) and h(Y) are also uncor- 
related. Prove the converse: if g(X) and h (Y) are un~ 
correlated for every pair of functions g and h , then X and Y 
are independent. 

Normal Distribution 

8. Using the normal distribution table compute: 



(a) 


P (-. 5 < X < .5) 


, where 


X 


is N(0 


.1) 


[Answer : 


. 383] 


(b) 


P( X < -2) , 


where 


X 


is N ( , 1 ) 


[ 





.0228] 


(c) 


P( Y > 5) , 


where 


Y 


is 11(0,4) 


[ 





.0062] 


(d) 


P (1 <_ Y <_ 4) , 


where 


Y 


is N (- 


2,9) 


[ 





.1359] 


Find 


a number a 


such that 














(a) 


P( X > a ) 


= 0. 


03, 


where 


X 


is 


N(0,1) 


[1.88 ] 


(b) 


P (-a <X<a) 


- 0. 


08, 


where 


X 


is 


N(0,1) 


[1.555] 


(c) 


P (-2-a < Y < a- 


- 2) = 0. 


10, 


where 


Y 


is 


N(-2,9) 


[4.935] 


(d) 


P ( Y > a ) 


= 0. 


98, 


where 


Y 


is 


N(0,4) 


[-4. 126] 



10. Show explicitly that the normal distribution N(m,a ) really 
does have variance a 2 . 

Significance Levels 

11. A prestigious scientific journal announces as part of its 
editorial policy that only results significant at the .01 level 
will be acceptable for publication (and conversely any result 
significant at the .01 level is acceptable). They reason that by 
doing so. tneir readership will have the confidence that at most 

1% of the published results will be incorrect. Discuss the fallacy 
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of this policy. [Hearing about this new policy, 1000 conscientious 
experimenters formulate 1000 wrong scientific hypotheses. On the 
average 10 of them would find a significant result, and tnese 10 
would then be entitled to publish their results. Let's say that 
these 10 articles constitute the first issue of the journal after 
the new policy is instituted. We would find that the journal 
policy allowed 100% of the published results to be wrong. Clearly 
the journal policy is a result of a misunderstanding of the nature 
of statistical hypothesis testing: significant at the .01 level 
does not mean that there is only a 1% chance that one is wrong.] 

12. In a scientific paper you read the following: "In four of 
our experiments the data are significant at the .05 level. The 
fifth experiment, however, is significant at the .01 level I " 
What is misleading about this? 

13. A population scientist believes that roughly 50% of the 
population is female, but doesn't want to be too hasty. So he 
decides to be cautious and to test whether or not at least 45% 
of tne population is female. To do this he takes a random 
sample of 100 persons. If he discovers that only 40 of them are 
female, does he have sufficient evidence to reject the model that 
(at least) 45% of the population is female? Use a Bernoulli 
process with p = .45 and a one -sided significance test. 
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14. A statistician wonders just how careful the scientist in 
exercise 13 was when he made his random sample. Is 40 significantly 
different from the expected value of 50? Is 40 significantly 
smaller than 50? What do you think of the sampling technique of 
the scientist? [Answer: Yes; yes; not much.] 

15. You own a company that produces medium quality left-handed 
screws. About 1% of the screws produced by one machine are 
defective. As the screws are produced, the defective ones are 
found and discarded. A count is kept of the number of defective 
screws produced each hour. The machine is readjusted whenever the 
number of defective screws produced is significantly greater than 
1%. You may regard this as a Bernoulli process. 
Tne machine makes 10,000 screws per hour. Describe a procedure 

for determining when the machine is out of adjustment at the 0.05 
level and at the 0.01 level. 

16. A congressman wishes to vote according to the "will of the 
people" on a certain bill. Now in this case one wishes to know 
whether the percentage p of his constituency in favor of the 
bill is above or below 50%. Clearly if p is close to 1/2 a 
rather large sample will be required to distinguish between the 
two possibilities. How is this reflected in a statistical test? 
For example suppose that a poll is made soliciting the opinion of 
a certain number of voters cnosen at random. Use a .05 signifi- 
cance level to decide what the congressman should do in each of 
the following cases. 



Number of pollees 



Number of pollees in favor of the bill 



100 
100 
500 
500 
1000 



54 
41 
267 
225 
534 



Note that the congressman has three choices in each case: (a) 
vote for the bill, (b) vote against the bill or (c) order a 
larger sample be taken. 

17. A company wishes to test the effectiveness of a new magazine 
advertising campaign. It decides that the campaign is effective 
if the proportion of subscribers to the magazine who use their 
product is twice as large as the proportion of non-subscribers who 
do so. A 10% significance level is agreed upon. It is known that 
15% of the general population use the firm's product. A sample 

of 50 subscribers is ordered and it is found that 10 use the 
product. What does this suggest about the advertising campaign? 
[Answer: One cannot say that the advertising campaign was un- 
successful] . 

18. A study has shown that in a certain profession the women are 
receiving only 88% as much on the average, as their male 
counterparts receive. However, the study is several years old 
and a women's organization wants to determine if the women in this 
profession are losing relative to the men. It is known that the 
men in the profession now receive 138% of the pay they received 
when the above study was conducted. A random sample of female 
professionals is made. The average pay of these women was found 
to be the following {as a percentage of the current average pay 

of men in the profession): 70%, 78%, 80%, 83%, 84%, 86%, 87%, 96%. 
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Compute the sample mean and sample standard deviation. At the 
.05 significance level are the women in this profession losing 
ground relative to the men? Does this coincide with your "gut 
feeling" in this problem? [Answer: No; no] 

19. The Food and Drug Administration (FDA) suspects that a drug 

company is producing a certain pill with a purity less than that 

required by law. The law allows at most 5 parts per million (ppm) 

of a certain impurity. An FDA laboratory tests a random sample 

of 50 pills. They find that the pills have a sample mean impurity 

o 

of 5.4 ppm and a sample variance of 4.38 (ppm) . Can they 
assert that the pills do not comply with the law? We will return 
to this question in exercise [Answer: If the law only re- 

quires that the average amount of impurity is 5ppm then we cannot 
reject the possibility that the manufacturer is complying with tile 
law. ] 

20. A paper company was a major polluter of a small river for 
many years. When antipollution laws were enacted it reacted 
slowly at first but later made a major effort to control its 
pollution. Unfortunately the firm suffered from its earlier re- 
calcitrance by acquiring a public image as a major polluter. Indeed 
a very large sample revealed that close to 90% considered the firm 
to be a major polluter. To counter this they began a public 
relations campaign. After the campaign a random sample of 200 
individuals were asked whether the company was still a major 
polluter. It was found that 174 felt this way. Did the campaign 

have a significant effect? [Answer: No] 
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21. (Hans Zeisel) . Dr. Benjamin Spock, author of a famous book 
on baby care, and others were initially convicted of conspiracy 
in connection with the draft during the Vietnam war. The defense 
appealed, one ground being the sex composition of the jury panel. 
The jury itself had no women, but chance and challenges could 
make that happen. Although the defense might have claimed that 
the jury lists (from which the jurors are chosen) should contain 
55% women, as in the general population, they did not. Instead 
they complained that six judges in the court averaged 29% women 
in their jury lists, but the seventh judge, before whom Spock was 
tried, had fewer, not just on this occasion but systematically. 
The last 9 jury lists for that judge contained the following 
counts : 

Proportion 





Women 


Men 


Total 


women 




8 


42 


50 


0. 


16 




9 


41 


50 


0. 


18 




7 


43 


50 


0. 


14 




3 


50 


53 


. 


06 




9 


41 


50 


0. 


18 




19 


110 


129 


0. 


15 




11 


59 


70 


0. 


16 




9 


91 


100 


0. 


09 




11 


34 


45 


0. 


24 


Grand totals 


86 


511 


597 


0. 


144 



Did the jury lists for this judge have a significantly smaller 
percentage of women? Because of the seriousness of the case, use 
an extremely small significance level: 0.0001. 

22. It has been said that Chevalier de Mere actually observed 
the subtle distinction in probability between obtaining at least 
one six in four throws of a die and obtaining at least one double- 
six in twenty-four throws of a pair of dice. See exercise Il.^g 
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Could he have done so? Since he did not have access to the 

elaborate machinery of the normal distribution and significance 

tests, it is difficult to imagine what he might have deduced 

about any observations he might have made. 

However, we could ask what is the probability that he would 

not have observed a difference between the two experiments in a certain 

number of trials. Let x n be the number of times out of n 

trials that at least one six is obtained in four throws of a die. 

Let y be the corresponding random variable for the double-six 
n 

trial. What is the probability that X - Y is positive? 

n n 

Since de Mere's calculation showed that Y R should have been the 
more probable of the two, such an observation would have shocked 
him. For definiteness compute this probability for n - 10 , 100, 
200, 500, 1000, 2000 and 5000. How many times would de Mere have 
had to have tried both possibilities in order to reject (at the 5% 
level) this explanation of his perplexity? How many throws of one 
or two dice does this involve? What can one one conclude? 
[Answer: 3900 and 109,200. Conclusion: Either de Mere tried 
this experiment a great number of times or else we cannot dismiss 
the possibility that he did not in fact succeed in detecting the 
difference between the two probabilities.] 

23. A political scientist wishes to determine if there is a 
significant difference between the preferences of voters in two 
similar neighborhoods of a city with respect to an upcoming race 
for mayor. Samples of 30 voters are taken from each of the neigh- 
borhoods. In one sample 12 voters prefer the incumbent while in 
the other neighborhood 19 do so. Using a 10% significance level, 
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decide whether there is a significant difference between the 
neighborhoods. [Answer: Yes] 

24. A medical researcher samples 100 records of adults having 
diagnosed coronary heart disease from one city, taking care to 
ensure that the sample is random. The average cholesterol value 
for these individuals was found to be 296, and the sample standard 
deviation was 30. The researcher then took a random sample of 

200 people from the same city who never had diagnosed heart 
disease. The mean cholesterol value for this sample was 310, and 
the sample standard deviation was 50. Do individuals without 
diagnosed heart disease have a significantly larger cholesterol 
value than those with diagnosed heart disease? [Answer: Yes] 

25. A small college soccer team won its conference championship 
9 times in the first 20 years of its existence. Then for the 
next twenty years it won only 3 times. Is this significant? 

Is it very significant? [Answer: Yes; no]. 

26. When the president of the company in exercise 20 discovered 
that the questionnaire used in the post-campaign sample included 
the word "still," she was incensed: the question seemed biased 
in favor of a yes answer. Accordingly, she immediately proceeded 
to write her own questionnaire and take a new random sample. 

The only change was the omission of the word "still." In this 
new sample of 200 individuals only 160 felt that the company was 
a major polluter. Did the alteration of the questionnaire have 
a significant effect? Did the public relations campaign have a 
significant effect? [Answers: Yes; yes] 



Confidence Intervals 

27. Using the information in exercise 24, give a 90% confidence 
interval for the following: 

(a) the individual cholesterol values of individuals without 
heart disease; 

(b) the individual cholesterol values of indiciduals with heart 
disease ; 

(c) the mean cholesterol value of all individuals without heart 
disease ; 

(d) the mean cholesterol value of all individuals with heart 
di sease . 

[Answers: 296 ± 49.35; 310 + 82.25; 296 ± 5; 310 ± 6] 

28. If a set of grades on a statistics examination are approxi- 
mately normally distributed with a mean of 82 and a standard 
deviation of 6.9, find: 

(a) The lowest passing grade if the lowest 10% of the students 
are given F s . 

(b) The highest B if the top 5% of the students are given As. 

29. The average life of a certain type of engine is 10 years, 
with a standard deviation of 3.5 years. The manufacturer replaces 
free all engines tnat tail while under guarantee. If he is 
willing to replace only 2% of the engines that fail, how long a 
guarantee should he offer? Assume a normal distribution. 
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30. The braking distances of two cars, F and C , from 
50 k.p.h. are normally distributed, one with mean 30m and 
standard deviation 8m, the other with mean 35m and standard de- 
viation 5m. If they both approach each other on a narrow mountain 
road and first see each other when they are 100m apart, what is 
the probability that they avoid a collision? [Answer: .9999] 

31. If the probability of a male birth is 0.512, what is the 
probability that there will be fewer boys than girls in 1000 births? 
[Answer: 0.215] 

32. A multiple-choice quiz has 100 questions each with four 
possible answers of which only one is the correct answer. What 
is the probability that sheer guesswork yields from 10 to 30 
correct answers for 40 of the 100 problems about which the student 
has no knowledge? 

33. A firm wishes to estimate (with a maximum error of 0.05 
and a 9 8% confidence level) the proportion of consumers 
who use its product. How large a sample will be required in order 
to make such an estimate if the preliminary sales reports indicate 
that about 25 percent of all consumers use the firm's product? 
How large a sample would be needed if no preliminary information 
were available? 

34. A sponsor of a weekly television program is interested in 
estimating the proportion of the city population who regularly 
watch its program. The sponsor wishes the estimate to be made 
with a 90% confidence level and an error of at most 4%. The 
sponsor nas no information concerning tne proportion of viewers who 
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watch the program. How large a sample will be required to make 
the estimate? 



35. A person has just hired a building contractor to build a 
house. The house will be built in three stages. First, the 
contractor lays the foundation; second, the frame and exterior 
are built; and last a subcontractor puts in the wiring, plumbing, 
and interior. Each stage must be completed before the next is 
started. In attempting to get an estimate of when the house 
will be totally completed, the purchaser is able to get the 
following information from those in charge of each stage. 

Stage Expected Time of Completion of Stage Standard Deviation 

(in Weeks) (in Weeks) 



I 3 1 

II 8 2 

III 5 2 

What is the expected value and standard deviation of completion 
time for the house, assuming the completion times of the st£iges 
are independent? 
[Answer: 16 weeks and 3 weeks] 

36. The College Entrance Examination Board verbal and mathematical 
aptitude scores are approximately normally distributed with mean 
500 and standard deviation 100 except that scores above 800 and 
and below 200 are arbitrarily reported as 800 and 200 respectively. 
What percentage of the students taking tne verbal exam score above 
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800 or below 200? [Answer: about 0.3%] 

37. You are the head of a polling company, and you have a contract 
to determine the percentage of the electorate in favor of a can- 
didate. There are 1,000,000 members of the electorate, and each 
member chooses his/her opinion independently. Your contract 
specifies that you must determine the percentage to within 1% with 
5% confidence or to within 5% with 1% confidence. Which is cheaper? 

38. A professor at a small college walks to school each day. On 
the average the trip takes 15 minutes with a standard deviation of 

3 minutes. Assume a normal distribution. If the professor's first 
class is at 10:30 AM, when must the professor leave home in order 
to be 95% certain of arriving on time? If the college serves 
coffee from 10:00 AM to 10:30 All how often would the professor 
have coffee before class if the professor left home at 10:10 AM 
every day? [Answers: 10:10 AM; 0.952]. 

39. Suppose that resisters can be purchased each with a resistance 
that is uniformly distributed between 900 ft and 1100 ft . If 10 such 
resisters are connected in parallel, what is the probability that 
their total resistance will be within 5% of 10 ,000 ft ? 

40. When a thumbtack is tossed, it falls on its flat head with 
probability p . What must you do to find p to within p/10 
at significance level 0.05? at significance level 0.01? 

41. You own a telephone company that services two cities A 
and B, each having 5000 customers. You would like to link your 
exchange with the more distant city, C . You estimate that 
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during the busiest time each customer will require a line to C 
with probability .01. You want to be sure that there are enough 
lines to C so that there is only a 1% chance that at the busiest 
time some customer will be unable to get a line to C. Each 
trunkline to C will cost $10,000. You have two options. Either 
link A and B as if they were separate exchanges or link the 
entire exchange to C . In the second option, additional equip- 
ment costing $50,000 would be needed. Which option is cheaper? 

42. A clinical trial is conducted to determine if a certain type 
of drug has an effect on the incidence of a certain disease. A 
sample of 100 rats was kept in a controlled environment and 50 

of the rats were given the drug. Of the group not given the drug 
(the control group) , there were 12 incidences of the disease, 
while 9 of the other group contracted it. Compute a 90% confidence 
interval for the difference in probability of contracting the 
disease between a rat given the drug and a control rat. [Answer: 
0.06 ± 0.134] 

Hypothesis Testing 

43. A statistician named Burr relates the following story. 

"Having bought a bag of roasted chestnuts, the author 
walked home in the dark eating them with much gusto. 
After eating about 20, he arrived home, and, in open- 
ing the remaining 10 under the light, he found that 7 
contained worms. VJhat is the probability that none 
of the 20 contained worms? Or to phrase the problem 
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better for statistical analysis: If there were 
only 7 wormy chestnuts among the original 30, what 
is the probability of drawing the first 20 all free 
from worms?" 

44. Since there is no reason to believe that the salaries of 
individuals will be normally distributed, only in very large 
samples can we expect the mean to be normally distributed. With 
this in mind reexamine exercise 18. Regardless of the distribution 
of salaries of the female professionals, half the salaries will be 
above the median salary. Suppose that the salaries of the male 
professionals are known to have a median that is 97% of the mean. 
This situation would be typical since the presense of a few very 
high salaries can cause the mean to be somewhat unrepresentative. 
Does it now appear that the female professionals are gaining or 
losing relative to the men? [Answer: the median salary of the 
women is 83.5%, while 38% of 97% is 85.4%. So it appears that 
they are losing ground, but the result is again not significant. 
Using a Bernoulli process to model this, if the median salary 

of all women were 85.4%, then each woman's salary would have 
probability 0.5 of exceeding this figure. We observe that 3 of 
the 8 salaries do so. The probability that 3 (or fewer) would do 
so by chance alone is (0.5) 8 + 8(0.5) 8 + ( 8 ) (0.5) 8 + (3) (0.5) 8 
= 0. 36 3 which is too large for us to reject this model.] 

45. (Certificate) In a certain survey of the work of chemical 
research workers, it was found, on the basis of extensive data, 
that on average each one required no fume hood for 6 per cent 
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of the time, one for 30 per cent and two for 10 percent; three 
or more were never required. If a group of four chemists worked 
independently of one another, how many fume hoods should be 
available in order to provide adequate facilities for at least 95 
per cent of the time? 

Compute the probability distribution of the number of fume 
hoods needed by the four chemists. Then use this to answer 

the question. 

46. In exercise 45, how many fume hoods would be required to 
satisfy a group of 50 chemists at least 95 per cent of the time? 
Use a normal approximation. 

47. A governmental agency is responsible for protecting the fish 
populations of the lakes in a certain region. By means of rruny 
observations in the past it sets lower bounds for populations, in 
each lake, of various species of fish. If it is later found that 
one species in a given lake has gone below the specified lower 
bound, the agency has the power to enforce limits on the pollutants 
which the factories bordering on the lake may discharge into the 
lake. How can the agency determine whether the lower limit has 
been reached? One way to do so is to employ the procedure do- 
scribed in exercise II.J23. Let the lower limit be L . We First 
capture n fish, tag them and return them to the lake. Soma 

time later, we drop a net, capture m fish and count how many 
are tagged. Describe how to find a number T so that there is 
only a 5% chance that T or more tagged fish will be found when 
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there are L or more fish in the lake. We will return to this 
problem in exercise VI. XX. 

48. (Silvey) An investigation was carried out on two suggested 
antidotes to the consequences of drinking, these being (a) 2 lb 

of mashed potatoes and (b) a pint of milk. Ten volunteers were 
used, five to each antidote, the allocation to antidote being 
random. One hour after each had drunk the same quantity of alcohol 
and swallowed the appropriate antidote, a blood test was carried 
out and the following levels (mg/ml) of alcohol in the blood 
were recorded: 

(a) 76 52 92 80 70 

(b) 110 96 74 105 125 

By means of the runs test, decide whether there is sufficient 
evidence to conclude that one treatment is more effective than 
the other. 

49. (Guenther) Suppose that it is hypothesized that twice as 
many automobile accidents resulting in deaths occur on Saturday 
and Sunday as on other days of the week. That is, the probability 
that such accidents occur on Saturday is 2/9, on Sunday is 2/9, 
and o.i each other day of the week is 1/9. From the national record 
file, cards for 90 accidents are selected at random. These yield 
the following distribution of accidents according to the days of 
the week : 

Sun. Mon. Tues. Wed, Thurs. Fri. Sat. 

30 6 8 11 7 10 18 

Do these data tend to support or contradict the hypothesis? 
Use a 5% significance level. 



50. We wish to test whether or not the successive outcomes of a 
roulette wheel are random. For simplicity we will only record 
whether the ball fell into a red or a black slot of the wheel. 
In twenty spins of the wheel we observe the sequence: 
RRBRRBBBBRBRRRRRBBBR. Applying the runs test and using a 5% 
significance level, are the successive outcomes random? What does 
this suggest about this roulette wheel? 

51. (Pazer & Swanson) A political scientist wishes to determine 
if the political preference of homeowners is independent of their 
immediately adjacent neighbors. A sequence of sixteen homeowners, 
along the same side of a street, were interviewed, and based 

upon their responses were designated as either more conservative 
than their median C, or less conservative than their median L . 
Here is the resulting sequence: 

L, C, L, C, C, C, C, L, L, C, L, L, L, L, C, C. 
Using the run test and a 5% level of significance, determine 
whether there is any evidence that political opinions are indepen- 
dent of one's neighbors (at least for this particular street). 

52 *: We all have taken laboratory courses at some time or other, 
and the temptation to "fudge" data on our report has certainly 
occurred to us. What we may not have realized is that one can 
devise a statistical test to determine whether or not such fudging 
took place. Suppose that a biologist wishes to prove that a 
certain genetic trait follows the classical Mendelian laws. in 
this theory a trait is determined by two genes, one acquired from 
each of the two parents. Let us say that there are two alleles 
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(possibilities) for a given gene, one dominant A and one recessive 
a . Then there are three different genotypes: AA, Aa and aa. 
Let us suppose, as it often happens, that AA and Aa are indistin- 
guishable. By successive inbreeding the biologist has access to 
two individuals known to have genotypes AA and aa, respectively. 
When these are crossed the offspring all have genotype Aa . But 
when t.v/o of the offspring are crossed we find that the three 
yenol ypes AA, Aa and aa appear among their offspring with prob- 
abilities 1/4, 1/2 and 1/4 respectively. Of course since we can- 
not actually distinguish AA from Aa , this means that on the average 
3/4 of the offspring exhibit the dominant trait and 1/4 exhibit 
the recessive one. Let us suppose that the biologist produces 
10,326 offspring from a pair of Aa parents. He observes that 
7746 have the dominant trait and 2580 the recessive one. These 
are very close to the expected numbers 7744.5 and 25 81.5 so he 
concludes that the experiment tends to support the hypothesis that 
this trait obeys the Mendelian laws. 

Compute the probability that such an experiment would actually 
result in as close a fit with the theory as the biologist actually 
found. At the 5% level can one reject the hypothesis that the 
experiment was properly carried out? [Answers: about 3.6%, yes]. 

The Law of Large Numbe rs 

53. How many times must one toss a fair coin in order to have 
95% confidence that it really is fair? Compare the number obtained 
by using the Bienayme-Chebychev inequality with what we get using 
the Central limit theorem. 
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54. Let X be a standard random variable (i.e. E (X) = and 
E(X^) = 1). Using the Bienayme-Chebychev inequality, compute 1 
the smallest a so that: 

(a) P(-a <X <a) > .95 

(b) P(-a<_X<a) _> .99 

(c) P(X > a) <^ .05 

(d) P(X > a) < .01 

Compare these values with the corresponding ones for case of X 
being N(0,1) . 

5 5? Let X be a nonnegative random variable. Prove that 
P(X > a) £ E(X)/a , for any a > , whether x has a variance 
or not. This is known as Markov's inequality. Show that Markov's 
inequality implies the Bienayme-Chebychev inequality. 

56? Prove the Law of Large Numbers for probability distributions 
having finite variance, using the Central Limit Theorem. The 
Law of Large Numbers is in fact true for all probability dis- 
tributions possessing a mean, but his is much more difficult to 
prove . 

oo 

57. Show that J, X n/ n converges with probability 1 , where 
i n=l 

X n is equally likely to be either +1 or -1 . 

58. Explore experimentally what it means for a random variable not 
have an expectation. Write computer programs to simulate 
the St. Petersburg game (Exercise III.W) and the gangster dis- 
tribution (exercise III. 36) • In each case print out two columns 

of numbers. The first column shows the number of times the random 



experiment has been repeated, and the second shows the sample 
average of all the trials made so far. For the gangster dis- 
tribution one should print a third column showing the median of 
all tne trials made so far. This last number is the hardest to 
compute, and unlike the sample mean it requires that all the 
previous trials be stored in an array. Give an intuitive inter- 
pretation for what it means for a random variable not to have a 
finite expectation. 
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Chapter V Conditional Probability 

The theory of probability consists largely in 
making precise the probabilistic language that already forms 
part of our language. In effect the purpose of this course 
is to learn to "speak probability" properly. The lowest 
level of our probabilistic language is the event. This cor- 
responds to simple phrases that are either true or false. 

For example in the Bernoulli process II . is the event "the 
th 

i toss is heads". Random variables represent the next 
level: simple quantitative questions. For example one 
might ask: "how long must one toss a coin until the first 
head appears?" If we use the convention = false and 1 = 
true, every event may also be regarded as a random variable 
by using the indicator. 

Conditional probability allows probabilistic reasoning. 
That is, we may now ask compound questions. For example, 
*if the first toss of a coin is tails, how long must one wait 
until the first head appears?" Moreover, we can split apart 
and combine such questions into new questions. The pre- 
cise meaning of such expressions is not always obvious and 
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is the source of many seeming paradoxes and fallacies. As 
a simple example, the question, "if the first toss is heads, 
is the second toss heads?" is very different from the 
question, "are the first two tosses heads?" The probability 
of the first is p while that of the second is p 2 . 

1. Discrete Conditional Probability 

We begin with the definition and properties of 
the conditional probability of events. 

Definition . Let A and S be events such that P(S) > 0. The 
conditional probability of A given S is 

PlA|S) P(S) • 

The event S is called the condition . 
The conditional probability P(a|s) 
answers the question: "if S has oc- 
cured, how probable is A?" In effect we have altered our 
sample space. Since we know that S has occured, the sample 
space is now S. The event A given that S has occured must 
now be interpreted as AaS, and the probability is P(AnS) 
normalized by the probability of S so that the total proba- 
bility is 1. Ordinary probabilities are the special case of 
conditional probabilities where the condition is the sample 
space Q: P (A) = P(A|^) . 
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Law of Alternatives 

Suppose that instead of knowing that a certain 
event has occurred, we know that one of several possibilities 
has occurred, which are mutually exclusive. Call these al- 
ternatives A^,A2#... . There may possibly be infinitely 
many alternatives. More precisely the A^ form a set of al- 
ternatives if 



(1) A^nA_. - if i f j (mutually exclusive) 

(exhaustive) 



(2) = 9. 



(3) P(A i ) > for all i . 



Then for any event B: 



P(B) = P(B|A 1 )P(A ] _) + P(R|A 2 )P(A 2 )+ ... 



Law of Alternatives 



To 



verify this law we simply expand and cancel. The A^ are 



disjoint so the events BnA^ are also disjoint. 
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p(b|a 1 )p(a 1 ) + p(b|a 2 )p(a 2 )+ ... 

P(B,vA 1 ) P(BAA 2 ) 

= -p7a^- p(a i> + -^ta7t p(a 2 )+ 

= P (BaA^ + P (Br»A 2 )+ . . . 
= P ( (Br>& 1 )u{BnA 2 )u . . .) 
= P (BftCA^ A 2 U . . . ) ) 

= P(BnQ) 
- P(B) 

If the alternatives A i are not exhaustive, we can 

still make sense of the law of alternatives provided all 

probabilities involved are conditioned by the event A = u A 

i i " 

More precisely we shall call a set of events A ± a set of al- 
ternatives for A if 

(1) A ± oAj ^ if i ^ j (mutually exclusive) 

(2) u i A i = A 

(3) P(A ± ) > for all i 

Then for any event B: 

P(B|A) - P(B|A 1 )P(A 1 |A) + P(B|A 2 )P(A 2 |A)+ ... 
Conditional Law of Alternatives 
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Bayes' Law 

One of the features of probability as we have developed it 
so far is that all events are treated alike: in principle no 
events are singled out as "causes" while others become "effects," 
Bayes" law, however , is traditionally stated in terms of causes 
and effects. Although we will do so also, one should be careful 
not to ascribe metaphysical significance to these terms. 
Historically, this law has been misapplied in a great number of 
cases precisely because of such a misunderstanding. 

We are concerned with the following situation. Suppose we 
have a set of alternatives , A 2 » . . . which we will refer to as 
"causes". Suppose we also have an event B which we will call the 
"effect". The idea is that we can observe whether the effect B 
has or has not occurred but not which of the causes A^,Ap,... 
has occurred. The question is to determine the probability that 
a given cause occurred given that we have observed the effect. 
We assume that we know the probability for each of the causes 
to occur, P(A^), as well as the conditional probability for B 
to occur given each cause, PCBlA^). The probability P(A^) is 
called the a priori probability of A^ , and we seek the probability 
P(AjJ B) which we call the a posteriori probability of A^. If the 
alternatives represent various experimental hypotheses, and B 
is the result of some experiment, then Bayes 1 law allows us to 
compute how the observation of B changes the probabilities of 
these hypotheses. 



P(A ± | B) = 



P(B|A i )P(A i ) 



^PCBlA^PCA^ 




Bayes' Law 
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Proof of Bayes ' Law 

Let A and B be any two events having positive probability, 
By the definition of conditional probability, 

P(B|A) = ^fg and P(A|B) = 

As a result we have two ways to express P(AB): 

P(B|A)P(A) = P(AB) = P(A|B)P(B). 
Solving for P(A|B) gives: 

P(A |B) = ZCBUlPCAl _ 

Now apply this fact to the case for which A is A^ and use the 
law of alternatives to compute the denominator. The resulting 
expression is Bayes' law. 

Law of Successive Conditioning 

Suppose we have n events ,1^ , . . . , such that 
P (B2AB^/> . . . ^B n ) >0 . Then we can compute P (B^aB2/n . . . AB^) 
using a sequence of conditional probabilities. 



P (3.. *B n a. . ,f\B ) ~ 

L /. n 



P (B, I B~A . . . aB ) »P (B~ | B-,rt . . . aB )...P(B . I B ) *P (B ) 

1 1 2 n 2 1 3 n n-1 1 n n 



Law of Successive Conditioning 



To prove this we just expand and cancel. This law cor- 
responds to the intuitive idea that the probability of 
several events occurring is the product of their individual 
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probabilities. This idea is correct provided ^e interpret 
"individual probability" to mean the appropriate conditional 
probability. 

By using the law of alternatives and the law of suc- 
cessive conditioning we split the computation of an ordinary 
or a conditional probability into a succession of conditional 
probabilities. In effect we form compound, nested, condi- 
tional questions out of simple questions. 

Independence 

Suppose that A and B are two events. If either A 
or 3 has probability zero of occurring, then A and B are 
trivially independent events. If P(A)>0 and P(B)>0, then 
the concept of the independence of the events A and B is 
best stated by using conditional probability. Namely each 
of the following are equivalent statements: 

(1) A and B are independent 

(2) P(A|B) = P(A) 

(3) P(3|A) = P(B) 

Using this terminology we can see much more clearly that the 

independence of two events A and B means that knowing one 

has occurred does not alter the probability that the other 

will occur, or equivalently that the measurement of one does 

not affect the measurement of the other. 
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^* Gaps and Runs in the Bernoulli Process 

st 

Recall that is the gap between the (i-1) and 

th 

the i success. We claimed that the are independent 
random variables, using an intuitive probabilistic argument. 
We now have the terminology for making this argument rigorous 
The key notion is the law of alternatives. 

Consider the conditional probability of the gap 
being n given that all of the preceding gaps are known : 



P(T i+1 = n| (T 1 =k 1 )^(T 2 =k 2 )n . . . A(T i =k i ) ) . 



Computing this probability is quite easy for it corresponds 

to exactly two patterns of H's and T's (up to some toss): 

k=k,+k~+. . .+k . 
„ LjS ± , — k 



k l k 2 k i 



P((T 1 =k 1 WT 2 =k 2 )n...^(T i =k i )) = q^p 1 



TT . . . TH TT . . . TH TT HTT . . . TH TT . . . TH 



k l k 2 k i n 



P( (T 1 =k 1 )A(T 2 -k 2 )/>.. .A(T i =k i ) / 0(T i+1 =n) ) = q k+n 1-1 p 1+1 
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Therefore 

k+n-i-1 i+1 1 
P(T. +1 =n| (T 1 =k 1 )rt(T 2 =k 2 )A..-o(T.=k.)) = q k _ ± / = q" p 

q p 

Although the above computation is not very dif- 
ficult, there is an easier way to see it. Think of the 
condition 

(T 1 =k 1 )A. . .A(T ± =k i ) 

as changing our sample space. The new sample space con- 
sists of all infinite sequences of H's and T's,but renumbered 
starting with k+1 = k^+k 2 + . . . +k^+l . This new sample space 
is identical to the Bernoulli sample space except for the 
renumbering and the fact that T^ + ^ is now the waiting time 
for the first success. Therefore 

P(T i+1 =n| (T 1 =k 1 )A(T 2 =k 2 )A..-A(T i =k i )) = q n_1 p. 

The key to the effective use of conditional probability is 
that it changes the sample space and hence the interpreta- 
tion of the random variables defined on the old sample space. 
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We now apply the law of alternatives. The events 

(T 1 =k 1 ) /> (t 2 =Jc 2 )a . . . n(T i =k i ) , 

as the kj's take on all positive integer values, form a 
set of alternatives; for the set of sample points belonging 
to none of them is an event whose probability is zero. 
P(T i+1 =n) = 

= E ...Z P(T i+1 =n| (T 1 =k 1 )/>...A(T i =k i ))-P((T 1 =k 1 ) n ...r l (T.=k.)] 
2_ . . »k^ ^- ^- 

= Z Z P((T,=k,)A...A(T.-k.)) 

1r V -LI 11 

K l "- K ± 

n~* 1 

= q ~ p Z Z P((T 1 =k 1 )^ ...A(T i =k i )) 

k i . ■ ■ k • 

1 i 

n-1 
= q p. 

Notice that the only fact we used about the events 
(T 1 =k 1 ) A . . . a (T i =k i ) was that they form a set of alternatives. 
An immediate consequence is that the gaps T i are all equi- 
distributed. Furthermore, if we use the definition of 
conditional probability we have that 
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P (T i+1 =n) = P (T . +1 =n | (y^k^ . . . .( T .= k .) ) 



= PttT^jn . .. „(T.= ki )„(T. +1 =n)) 



or 



P ( (Tl=k l )n ■ • • " (T i =k i>* < T i + l="> ) - P ( (T^), . . . n(T . =k . , , p (T . +i=n , 
By mathematical induction we have that 

P(( Tl = kl )....n( T .= ki) . (Ti+i=n)) . P( Vkl ,... P(Tl ^ i)p(Ti+i ^ )f 
i.e. that the T ± are independent. 

We can now see more clearly how the T. are related. 
They have the same distribution, but they are not the same. 
They have this property because the measurement of the i th gap 
"really" occurs in a afferent sample space than the first gap, 
but this new sample space is identical to the ordinary Bernoulli 
sample space except for how we number the tosses. 

We went into detail for this argument to illustrate a 
nontrivial use of the law of alternatives. We will be more 
abbreviated in the future. 
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As an illustration of the law of alternatives, we 
consider a problem mentioned in chapter II. Namely, what 
is the probability that a run of h heads occurs before a run 
of t tails? Let A be this event. We solve this problem by 
using the following fact: when a run of less than h heads 
is "broken" by getting a tail, we must "start over" and 
similarly for runs of tails. 

First we use the law of alternatives: 

P(A) = P(A|X 1 = 1}P(X 1 = 1) + P(A|X 1 - 0)P(X 1 = 0) . 

Write u = P(A|X 1 - 1) and v = P (a\x 1 = 0) so that 

P(A) = up + vq. 



Next we use the conditional law of alternatives for 
each of P(a|x 1 =1) and P(a|x 1 =0). Consider the first one. 

We know that we got a head on the first toss so the run 
has started. We then "wait" to see if the run will be 
broken. That is, let B t be the waiting time for the first 
tail starting with the second toss. Either we get a tail 
and break the run or we get enough heads so that A occurs. 
More precisely, 

PCAlx^l) = P(A| (X 1 =l)r>(B t <h))P(B t <h) 

+ P(A| (X 1 =l) rt (B t >h) )P(B t >h) . 
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For the first alternative the run of heads has been broken 
by a tail. Hence P (A | (X 1 =1)a ( B fc <h) ) = P (A 1x^=0) = v, for 
the earlier heads have no effect on subsequent tosses. All 
that matters is that we "started" with a tail. On the othei 
hand, P (A | (X 1 =l)/\ ( B fc >_h) ) = 1 because the condition implies 
that A has in fact occurred. Therefore, 

u = PtAlx^l) = vP(B t <h) + l-P(B t >h) 

/t h-l» , , , h-l N 
= v (1-p ) + 1 • (p ) 

Remember that for B fc we start counting on the second toss, s 

that P(B t <h) is really the conditional probability P(B t <h|xj_ 
The computation for P(a|x^=0) is analogous to that 

above. Let B h be the waiting time for the first head 

starting with the second toss. Then 

P(A|x 1 =0) = P(A| (X 1 =0)A(B fi <t))P(B h <t) 
+ P(A| (X 1 =0)A(B h >t))P(B n >t) . 

Here P (A| (X^O) A('B >t) ) = because the condition implies 
that A has not occurred. The probability P (A | (X 1 =0 ) a ( B h <t) ) 
is u because the run of tails has been broken. Therefore, 

v = P (A jx^O) = u[l-q t "" 1 + C^" 1 ] . 
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Combining the two equations above gives us the system 
of equations : 

,, h-1, , h-1 
u = v (1-p ) + p 

v = u« (1-q ). 

Solve for u and v and substitute: 

P(A > = ^ + vq = Vl' 1 ^' h-1 fc _ 1 

p +q -p q 

We check this by considering the special case h=t and 
p=q=l/2. As we expect by symmetry, P (A) = 1/2. 
3 . Sequential Sampling 

In most sampling situations, for example sampling 
people in a population, we generally sample the population 
without replacement, i.e. the same individual cannot be 
chosen more than once in one sample. For such a sampling 
procedure, the successive choices are not independent of 
one another; for with each choice, the population (and hence 
the sample space) gets smaller. 

For very large populations this would seem to be a 
small effect. But on smaller populations it can be pro- 
nounced. For example, suppose we play a card guessing game. 
We draw a card at random from a deck, try to guess the suit, 
look to see if our guess was correct and then place the card 
aside. If we continue to sample the cards this way, the 
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probabilities for getting a card of a given suit change con- 
stantly. Indeed, we will always know for certain what the 
suit of the last card drawn will be. 



dependence of each of the choices on the other choices. The 
idealized model is the following. We have an urn containing 
r red balls and b black balls. We select a ball at random 
from the urn, note its color and then place it aside. This 
procedure is repeated until n balls have been chosen from 
the urn. Define the random variables X i by: 



The problem is to find the distributions and the correla- 



Notice that we have switched the roles of the balls 
and the boxes. In the occupancy model we place balls into 
boxes. In the sampling model the balls become the positions 
in the sample and the boxes become individuals in a popu- 
lation. In the sequential sampling model it is traditional 
to view the population more concretely as a collection of 
colored balls in an urn. 

Consider the first choice X^. The probability distri- 
bution of X, is 



The problem of sequential sampling is to describe the 




tions of the X^. 



1 



p Q = P(X 1 = 0) = 



b 



r+b 



11 = r^F 



Next consider the second choice X 2 . To compute its dis- 
tribution we must use the law of alternatives. For example 

b — 1 

P (X 2 =0 1 X 1 =0) is - r ^_^ because there is one fewer black 
ball in the urn. 

p Q = P(X 2 =0) = P (x 2 =o | X 1 =0) P (X 1 =0) +P (x 2 =o | X 1 =l) -P (x 1 =i) 

= h ~ 1 _ . b + b . . r 
r+b-1 " r+b r+b-1 * r+F 

(b-1) *b+b*r 
(r+b-1) (r+b) 

b (r+b-1) 
(r+b-1) (r+b) 

b 
r+b 

Similarly , 

Pl = p(x 2 =1 ) " rTF • 

The random variables X^ and X 2 are equidistributed! This 
is quite unexpected. One wonders whether this is an ac- 
cident of algebra or there is some deep principle here. If 
the latter, we would expect that all the have the same 
distribution. As we shall see this is indeed the case. 
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The seeming paradox arises from the fact that we are not 
considering the random variables conditionally. For ex- 
ample, in the card guessing game above, if we chose not to 
look at the first 51 cards sampled, we would have no reason 
to suppose that the last card sampled has any special 
properties: it doesn't "know" that the other cards have 
been sampled. 

Exchangeability 

As often happens in mathematics, the situation only 
becomes clear when we consider it from a broader perspective. 
Consider the joint distribution of all the X i 's: 

C. = ?((X 1 =i 1 )n(X 2 =i 2 )n... n(X n -i n )) 

x l n 

where the i ir ...,i n take on the values and 1 arbitrarily. 
We compute this by using the law of successive conditioning: 

C. = P(X 1 =i 1 )P(X 2 =i 2 |x 1 =i 1 )P(X 3 =i 3 | (X 1 =i 1 )n(X 2 =i 2 )) ... 

1 1 x n 

For example, if n = 6 and (i^,i 2 , . . . ,i^ ) = (0,1,1,0,1,1), 
then 
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^0,1,1,0,1,1 r+b r+b-1 r+b-2 r+b- 3 ' r+b -4 ' r+b^5~ 

(b) 2 (r) 4 
" (r+b) 6 

Each factor is the number of balls of the appropriate color 
at the time divided by the number of balls in the urn at 
the time. More generally, if we have drawn a sequence of 
k reds and j blacks, then the probability is 

(b) (r) k 

°i 1 ""'in = (r+b) . . 
1 n j+k 

The probability of drawing a given sequence of reds and 
blacks depends only on the number of reds and blacks drawn. 
In other words P ( (X^i^ n (X 2 =i 2 ) n . . . n (X n =i n ) ) is the same if 

we permute the i^,...,i n leaving the X _. • s alone (or equiva- 
lently if we permute the X^'s leaving the ij' s alone). For 
example, P ( (X^i^ n (X 2 =i 2 )n . . . n(X n =i n ) ) =P ( (X 1 =i 2 ) n (X 2 =i 1 )n . . . n(X =i n )) 

Since we can compute the individual distributions of 
the X_.'s from the joint distribution by taking marginals, we 
immediately get that the X.'s are equidistributed. Moreover 
the joint distribution of, say, X-^ and X^ is the same as 
that of X-j^ and X 2 : 
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P((X 1 =i 1 )n(X 5 =i 2 )) « P(X 1 «i 1 )n(X 2 =i 2 )) 

and the latter is easy to compute. In general the joint 
distribution of any k of the X^'s is the same as that of the 
first k of them. All these facts follow from the fact that 
the joint distribution of the X^'s is unchanged when we per- 
mute the Xj's. This is the real reason that the choices in 
sequential sampling are equidistributed . 

Definition . Random variables (either integer or continuous) 
X^,...,X n are said to be exchangeable when their joint distri- 
bution (or density) is a symmetric function. 

An example of a set of exchangeable random variables 
we have already seen is a set of independent, equidistributed 
random variables. If X^,X 2 and X 3 are independent, equi- 
distributed integer R.V.'s, then 

P((X 1 =i 1 )n(X 2 =i 2 )n(X 3 =i 3 )) = P(X 1 =i 1 )P(X 2 =i 2 )P(X 3 =i 3 ) 

= P i Pi Pi 
X l X 2 X 3 

But as we have just seen, being exchangeable is not as strong 
a condition as being independent and equidistributed. 



We mention in passing that being exchangeable is not 
really that much more general than being independent and 
equidistributed. There is a deep theorem of probability 
theory which, roughly speaking, says that every set of ex- 
changeable random variables can be "synthesized" from inde- 
pendent equidistributed random variables by suitable con- 
ditioning. 

The Polya Urn Process 

A slightly more general sampling model than sampling 
either with replacement or without replacement is called the 
Polya Urn Process . In this process we begin with an urn con- 
taining r red balls and b black balls. We draw a ball at 
random. If it is red, we put the drawn ball plus c more 
red balls into the urn. If it is black, we put the drawn 
ball plus d more black balls into the urn. We then repeat 
this. Sampling with replacement is the case c=d=0, and 
sampling without replacement is the case c=d=-l. 

This process was originally introduced as a model of 
epidemics. If we think of the red balls as diseased indi- 
viduals, then each discovery of a red ball increases the 
likelihood that other balls will be red (c>0) . There are 
obvious defects in such a model which we will not pursue. 
We will just think of this process as a general form of 
sampling. 
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As before let X^, . . . ,X be the successive results of 

drawing n balls in the Polya Urn Process. The computation 

of the joint distribution of the X.'s is much the same as 
J 3 

before. For example, 



C l, 1,0, 0,1 88 p (< x 1 =:1 ) n ( x 2 =:1)n(x 3 ==0)n(x 4 =0)n(x 5 s=1)) 



r a r+c # b # b+d m r+2c 

r+b " r+b+c * r+b+2c * r+b+2c+d " r+b+2c+2d 



In general, the X_.'s will not be exchangeable, but if c=d , 
they are. For those who like formulas, the probability of 
drawing j blacks and k reds in any order is 



r (j) b UO 

(-) (-) 

c _ c 

, u (j+k) 
c 



provided that d=c^0 and is 



r J b 



(r + b) 3+k 



if c=d=0 . 



The Arc sine Law of Random Walks 

We use the same notation as in section III. ID. The 
arcsine law is the distribution of the time of the last 
visit of a random walk to the origin. More precisely, 
consider a random walk up to time 2n, and ask when the last 
time was that the random walk visited the origin. Let L 2n 
be the time of the last visit. Clearly the random walk can 
only visit the origin during even-numbered times. We want, 
therefore, to compute P(L 2n =2k) for all k between and n. 

Now examine the event (L 2n =2k) . We can rephrase this 
event as saying that the random walk was at the origin at 
time 2k, and that from then on the random walk never visited 
the origin: 

(L 2n =2k) = (S£ k =0)n <S£ k+] /0M (S^ k+2 ^0)A ... n(S^O). 

The law of successive conditioning tells us that 
P(L 2n =2k) = P(L 2n =2k|s^=0)P(S' k =0) . 

Now P(L 2n =2k|s£ k =0) is the same as p (^n-^k* * * This fol ~ 
lows from the independence of the steps of the random walk. 
We know how to compute P(S£ k =0) so we must find a way to 
compute 
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p(L 2n -2k ==0) 88 p((s{^0) n (s^o)^ ... ^(s^ n _ 2k ^0)) . 

For this we use the law of alternatives, conditioning on 
which way the walk went during the first step: 

P(L 2n-2k= 0) " P(I- 2n _ 2k =0|x i = + l)P(X i = + l) 

+ P(L 2n-2k =0 l X i = - 1)P(X i = - 1) • 

= 3 p « L 2n-2k- l X i- +1,+ 2 p(L 2n-2k-°l X i- il) - 

By symmetry both of the above conditional probabilities are 
the same. Thus 

P(L 2n-2k= 0) = P < L 2„-2k = °l X i = 

If we now change coordinates we may consider the "walk" as 
starting at (1,-1): 




We now see that we have a familiar situation. P(L^ ^ =0 1 X-J^-l) 

zn-zK l 

is the probability that the random walk travels no farther 
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to the right than the origin in the first 2ri-2k-l steps, 
x.e. p ( M 2n -2k-l =0) ' where M n is the maximum position of the 
random walk in the first n steps (see sectyion IV. 2a). Thus 

It is easy to see that p(2n-2k-l,0) = 0. Thus 



P(L 2n-2k =0) = P(2n-2k-l,l) = 



2n-2k-l 
n-k 



,2n-2k-r 



= ( 2 n-2k-l) i l 

(n-k) I <n-k-l) » " 2 2n-2k-l 



= ( 2n ~ 2 ^) 1 . (n-k) 1 

"(n-k) ! (n-k) • (2n-2k) " n 2n-2k-l 



2n-2k 
n-k 



= p(2n-2k,0) 



Returning to our original problem, we find that 
P(L 2n -2k) = P(4 n -2k"°J p f S 2k =0) 
= p(2n-2k,0)p(2k f 0) . 

We now show why this is called the arcsine law. Using 
Stirling's formula, we find that 
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p(2n-2k,0)p(2k,0) = 



2n-2k 




r > 
2k 


i 


n-k 




k 





(2n-2k) 1 (2k) i 1 
((n-k) i) 2 (k!) 2 2^ 



(2n-2k) 2n " 2k ✓27r(2n-2k) (2k) 2k /gir (2k> 1 
(n-k) 2n " 2k 2Tr(n-k) 



22n-2k ^ 



/rT (n-k) /fT~ 2 



irA(n-k) 



Hence 



P(L 2n =2k) » 



irA(n-k) 



Set x = - . Then P (L =2k) ~ 1 
n 2n 



ffn/x (1-x) 

Thus when n is large the distribution function of P (Lj £2k) , 

is approximately equal to the area from to k/n of the 

1 



function f (x) = 



P(L 2nl 2k) * f 



tt/x (1-x) 
k/n 



dx 



tt/x (1-x) 
Using the substitution y = /x, we find that 
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dx 

- tt/x (1-x) 



Thus 

P(L 2n <2k) = | arcsin (vCTH") • 
Summarizing, 

Let L 2n be the time of the last visit of a 2n-step random 
walk to the origin. Then 

P (L =2k) = p(2n-2k,0)p(2k,0) - — , 

and P(L <2k) * - arcsin (/kTn) . 

The Arcsine Law 

Here is an example of this law. A gambler plays a 
fair game, betting one dollar every ten seconds on the toss 
of a fair coin. If the gambler plays for a whole year, what 
is the probability that the last time the gambler "broke even" 
occurred after one day of play (i.e. the gambler had either a 
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"winning streak" or a "losing streak" for 364 days) . The 
arcsine law provides an excellent approximation of this 
probability : 

PtL 3153600 i 8640) " - 0333 ' 

i.e. about one chance in 30. This is amazingly large. One 
can analyze the fluctuations of coin tossing in even more 
detail. The surprising conclusion is that it is very unlikely 
for a random walk to spend close to the same amount of time 
on both sides of the origin. Thus while the average value 
of is zero for all n; nevertheless, individual random 
walks with high probability will exhibit behavior that a naive 
observer would regard as being very unrandom. 

5. Continuous Conditional Probability 

Consider the Uniform process of sampling n points 

th 

from the interval [0,a]. Suppose we know that the k 
point in order, X (k)' was t . Given this information, what 
is the smallest point, It seems reasonable to answer 

this question with the conditional probability distribution 

F(x) = P(X (1J < x|x (kJ = t) 

of X,,, given that X , = t. Unfortunately, we know that 
(1) ^ (k) 

P(X,,* = t) =0; so that, technically speaking, the above 

' k ) 

conditional probability does not make sense. 
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On the other hand, it is easy to compute what 
P ^ X (1) — x l x (k) ~ ^ ought to mean. For suppose that 

X,, , = t. This means that exactlv k-1 points have fallen 
(k) 

in the interval [0,t], The random variable X^ should 
therefore be reinterpreted as the first order statistic of 
the Uniform process of dropping k-1 points in the interval 
[0 , t] . Therefore 

P(X (1) < x|X (k) = t) = 1 - (^J*- 1 

Notice that we do not have to "choose" the k-1 points 

which are to fall in [0,1], This choice is already im- 
plicit in the fact that we have conditioned by (X , = t) . 

) 

Although we cannot make sense, in general, of a condi- 
tional probability P(a|b) when P(B) = , we can do so 
when B is the event (X=t) for a continuous random variable 
X. We will call this the continuous conditional probability 
(although we shall often drop the adjective "continuous.") 
The following is the formal definition of this concept. But 
one rarely uses the definition directly. As with the 
ordinary conditional probability, the best way to compute a 
continuous conditional probability is to regard the condition 
as defining a new sample space and to reinterpret the events 
and random variables of the old sample space in this new 
sample space. 
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Definition. For an event A and a continuous random variable 
X, the continuous conditional probability of A given that 
X = t is 



P(A|x=t) = Urn P(A| t<X<t+e) 

P (An(t<X<t+e) ) 

= Him 



A P (t<X<t+e) 



provided that it exists. 

Notice that we do not divide by e. The reason is that 
e appears in both the numerator and the denominator. If 
you wish, P(A|x=t) is the limit: 



„ . P (An(t<X<t+e) ) /e 
Jcim — 



e->0 P (t<X<t+e) /z 



Both the numerator and the denominator in this limit 
have "densities" as their limits. 

Just to make sure, we will compute P (X ^ <x | X ^ -t) 
directly from the definition to see that we get what we 
computed earlier. We know from our computation in section 
IV. 6 that 

k— 1 n— k 2 

P(t<X,, ,<t+e) = n(^"l) - ' E * (a " t " £) + ^y. (complicated expres- 

< k >- k " 1 a 11 * sion) 



Next we compute P ( (X ^ >x) n (t<X ^ <t+e) ) . Except for a term 

2 

having a factor of e , this event corresponds to having k-1 

points in the interval [x,t] , one point in [t^+e] and the 

rest in [t+e,a]. Therefore, 
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P( (X (1) >x) n (t<X fk) <t+e) ) 



(t-x)*" 1 -^ (a-t-e) n ~ k A e 2 , . . . , 
n(^_^) + —j . (complicated expression) 

a a 



We now combine the above two computations. 



P( (X (1) >x) n (t<X (k) <t+e) ) 
P(t<X (k) <t+e) 



,n-l x , . vk-1 , . x n-k -n , 2 , . 
k-1 •e , (a-t-e) «a + e •(expression) 

,n-l x .k-1 " ~ x n-k " 2 ~ '. " 

n k-1 •e , (a-t-e) # a + e -(expression) 



,t-x x k-l n 
(-£—) as e — »0 . 



Finally we get P (X (1) <x | X (k) =t ) = 1 - (^~) k " 1 as before. 

Needless to say this is the hard way to compute this. 

Consider one more example. Suppose we know that the 
first point, X-^, is t. Given this, what is the smallest 
point, x (i) ? Again the answer is a probability distribution 

F(x) = P(X (1) < xlx^t) . 

We split this into two cases. 



Case 1 x<t. ,0 x 
i #_ 
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By the independence of the X^'s in the Uniform process, 
P (X <x | X^=t) should be interpreted as p (X^j^x) but in 
the Uniform process of sampling n-1 points from [0,a], That 
is, knowing that is t does not influence whether any other 
points are smaller than x. Therefore, P (X ^ ^ <x | X^=t) =1- (——■) n ^ 



Case 2 x > t 



Here the fact that X^=t means that (X^j<x) has oc- 



curred. Therefore, P (X ^ ^ <x | X.^=t) = 1 



F(x) 
1 4- 



The Conditional Distribution F (x) =P (X ^ ^ <x | X 1 =t) 



Combining these two cases, we find that P (X ^ ^ <x | X^=t) 
is not a continuous function. When we condition by (X^=t) , 
the random variable becomes discontinuous. This will 

often be the case for conditional distributions. Later we 
will develop techniques for using discontinuous random 
variables as if they were continuous. 
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The Continuous Law of Alternatives 

One of the most important facts about continuous 
conditioning is that the law of alternatives has a con- 
tinuous version. Indeed continuous conditional probabilities 
are important primarily because of this. Recall that if a 
set of events A 1 ,A 2 ,... form a set of alternatives, then the 
probability of any event B is 

P(B) = E P(B|A i )P(A i ) . 
i 

For continuous conditional probabilities, we replace the 
alternatives A i by the "alternatives" (X=t) , where t takes 
on all real values, and we replace the sum by an integral. 

For any continuous random variable X and event A for 
which the continuous conditional probabilities P(A|X=t) 
exist , 





f 00 




P(A) = 


P(A 


X=t) dens(X=t)dt 


> 


00 




Continuous Law 


of Alternatives 



We will give a rigorous proof of this law. The key fact 
we need is the Mean Value Theorem of Calculus. Recall what 
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this says. If f is a continuous function on the interval 
[a,b] , then for some point x between a and b, 



1 f b 
f(x) = b^a" f(x)dx. 



Proof 2l tne Continuous Law of Alternatives 

Let e>0 be a small number. Divide up the real line 

into intervals of length £ by the points t = ne 

n * 



^ 1 1 » » » i i » , 

fc -2 t-l fc 4 * 2 fc 3 



Take B. to be the event (t.<X<t. +1 ) . Then the B. form a set 
of alternatives. By the (ordinary) law of alternatives, 



P(A) =Y] P(A|B. )P(B.) 

i i 

P (A | t i <x<t.+ e ) p (t i <x<t i +g) 

By the mean value theorem applied to f(t) = dens (X=t) , 
there is some t. in the interval tt^t.^] such that 



w: p<vx<t 1+E > 

l 



or P(t i <X<t i + c ) = f(t i )At i . 
Therefore 

P(A) P(A|t i <X<t i + e )f (t i ) At ± . 
Now as e-i-0 this last sum approaches 

CO 

P(A|x=t)f (t)dt, 

J — CO 

by the definition of the integral. We have therefore 
proved the continuous law of alternatives. 

Notice that we used the fact that the density of X, 
f(t) = dens (X=t) , is continuous. Actually, in practice 
it will only be piecewise continuous. This technical de 
tail will never bother us. The continuous law of alter- 
natives holds in this case also. 
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6. Conditional Densities 

In most computations concerning continuous random 
variables, the densities are much easier to handle. We 
can give density versions of conditional probability, con- 
tinuous conditional probability and the continuous law of 
alternatives . 

Let us begin with the simplest case: conditional 
density. Suppose we have an event 3 such that P(B)>0 and 
a random variable Y. The distribution of Y given that B has 
occurred is 

F (s) = p (Y<s |B) . 

In general it is possible that a continuous random variable 
can fail to be continuous after conditioning, as we saw in 
the previous section. But if it is still continuous, we may 
speak of the conditional density of Y given B: 



P{Y<.s |B) . 

(X=t) , then the condi- 
limiting process just as 



dens(Y=s|B) - F'(s) = ~ 

dy 

If the event B is of the form 
tional density can be defined by a 
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we did in the last section. More precisely, the continuous 
conditional density of Y given X = t is 



dens (Y=s |x=t) = Him dens (Y=s | t<X<t+e ) , 

e+0 



if this limit exists. If dens (X=t) 7*0 , then 



. c , v > v _. x _ dens((Y=s)n(X=t)) 
dens(Y-s|X-t) = dens ( x==t ) 



exactly as one would expect. 

Consider again the example of section 5. The con- 
ditional density dens (X ^ j =x | X =t) , when x<t f is the same 

as the density dens(X^j=x) but for the process of dropping 
k-1 points on [0,t], i.e. 



Hq „ c , v _ v I v _ (k-1) (t-x) k 2 
dens(X (1) - x|X (k) -t) 



We could also compute this as follows: 



dens ( (X m =x) n (X,,. =t) ) 

dens (X ,., ,=x|X,,,=t) = -g ^ rr — — 

(1) 1 (k) dens(X^ k j=t) 
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Two two densities on the right were computed in section in. 7 ) 



i , k — 1 / ,\ n— k 
dens(X (k) =t) = n^ 1 ) ^ 



dens((X n .=x) n(X /t .=t)) =( J 
UJ {K} \0,l,k-2,l,n-k / 



(t-x) k " 2 (a-t) n " k 
n 

a 



Therefore , 



dens (X (1) =x|x (k j=t) 

( n ) 

_\ 0,l,k-2,l,n-k/(t-x 



k-2 n-k 
) (a-t) 



,n-l . .k-1 , . . n-k 
n( k-1 )t (a-t) 



(k-1) (t-x) k ~ 2 



The Continuous Bayes 1 Law 

By using conditional densities one can formulate a 
continuous version of Bayes 1 law. Suppose we have two 
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random variables X and Y. We call X the "cause" and Y the 

"effect". For example X may represent a parameter in an 

experiment, which we cannot measure directly, while Y is some 

directly measureable quantity. We want to determine the 

effect on the distribution of X given a particular observation 

of Y. As in the discrete Bayes* law, we assume that we know 

the a priori distribution of X, dens(X=x), as well as the 

conditional densities of Y given a value of X, dens (Y-y| X=x ) . 

By a calculation almost identical to the one for the discrete 

Bayes' law, we have this formula: 

■■■ > 

dens(X=x| Y=y) - dens(Y=y| X=rx)dens(X=x) 

C°°dens(Y=y|x=t )dens(X=t )dt 

Continuous Bayes 1 Law 
, . 

Continuous Law of Successive Conditioning 

In a similar manner as that above, one can state a 
continuous analog of the law of successive conditioning. 
We leave the details as an exercise. 

If X-j,X2,...,X are a sequence of continuous random 
variables, then their joint density is given by: 

densCX^t^ X 2 =t 2 ,..., X n =t n ) = 
= dens(X 1 =t 1 )dens (X 2 =t 2 | X 1 =t 1 )••• 

....densCX^tJx^t^..., X^t^) 

Continuous Law of Successive Conditioning 
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7* Gaps in the Uniform Process 

As an application and illustration of the conditioning 
techniques just introduced, we give a detailed and rigorous 
treatment of the gaps in the Uniform process. We begin 
with a problem posed in the introduction. Namely, if we drop 
a set of needles, each of length h, on a stick of length b, 
what is the probability that none of the needles overlap? 

Needles on a Stick 

We first restate the problem in terms of the (jlnxfoo* 
process. The position of a given needle is completely 
determined by its left endpoint. The process of dropping 
n needles of length h on a stick of length b is then the 
same as dropping n points on the interval [0,b-h]. Write 
a=b-h . 



O r ' ; * a b 

| . » 4 * — | 1 

5 needles of length h on a stick of length b 
= 5 points on an interval of length a = b-h 

Now two needles are non-overlapping if and only if 
their left endpoints are at least distance h apart. Let a 

be the event "n needles on a stick of length b = a+h do 
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not overlap". Then 



p(A a J e P((L->h)n(L,>h)n ... n(L >h) ) 
a. f n £. — j — n — 



. min L.\ > h 
= n 2<i<n X > - 



We will first compute the probability of a slightly 

different event. Let B be the event "n points dropped 

a f n 

on [0,a] are all at least distance h from each other and 
from the right endpoint." This is exactly the same as A 

a f n 

but we have added the condition that the last gap, L ... 

^ c ' n+1 

also be larger than h, i.e. 



A a = (L 9 >h)n (L^>h)n . . . n(L >h) 
a,n £. — j — n — 

B = « = (L~>h)n (L,>h)n . . . n(L >h) n (L , , >h) . 
a,n £. — -5 — n — n+i — 

To compute p ( B a ^ n ) we condition on the position of the 
largest point: 



P(B„ ) = 



p ( B „ „l x , >=t)dens(X, ,=t)dt 



a,n' J a,n' (n) (n) 

■a-h 

P(B^ !x, »=t)2- t n_1 dt 
'(n-l)h a ' n (n) a 11 
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Here the limits of integration stem from the fact that 

B can only occur if the largest point falls so that the 
a , n 

rightmost gap, L n+1 ' is larger than h and so that there is 

enough "room" for n-1 gaps of size h to the left of X ( n )« 

Now P (B |x, ,=t) is the same as the probability of drop- 
a,n' (n) 

ping n-1 points on [0,t] so that all gaps are at least h 
and also so that the largest point, X ( n _;u' falls at least 
distance h from t. Thus 



P(B |x, .=t) = P (B , ) . 
a,n' (n) t,n-l 



We may therefore use mathematical induction. For if 

we write p (a) = P (B = ) , then 
n a , n 



p (a) = 



a-h „ _ ■■ 

/ , s n , n- 1 , , 
p , (t) — t dt 
. ^n-1 n 

(n-l)h a 



Consider p^a). This is the probability that a point 
dropped on [0,a] falls farther than distance h from a. So 
a-h 

More generally tne cujuve j- 

n 



p ( a ) = fllJl . More generally the above inductive formula 

JL a 



a-nh \ 

can be used to deduce that P n ( a ) - 
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To compute P (A ) from what we know about P (B ) , 
a f n a , n 

we condition on the largest point X ( n )> *t is easy to see 

that P (A |x, x =t) = P(B. , ). Therefore, as above, 
a,n ( (n) t,n-l 



P(A_ _) = 



a . n 



P(A |X, x =t)dens(X, N =t)dt 
a,n> (n) (n) 



a 

' (n-l)h 



P(B, n nJV 11 " 1 dt 
t . n— 1 n 

' a 



n 



n-1 



n 



a" J (n-l)h 



( t-(n-l)h ^ t n-l dt 



n 



n 



(t-(n-l)h) 
n 



n 



(n-l)h 



(a-(n-l)h) n 
n 

a 




(provided b>nh) 



Exchangeability of the gaps 

Recall that we stated in section II. 3 that the gaps 
in the Uniform process are equidistributed but that we 
gave only an intuitive justification. We can now give a 

rigorous proof using conditional density. 
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Consider the first two gaps. The density of 1^ is 
n(a-t 1 ; 

dens(L =t ) = zr • Therefore, by the law of alternatives, 



n 



dens(L 2 =t 2 ) = 



J —CO 



dens (L 2 =t 2 | I*^^) dens (L ] _=t 1 ) dt ± . 



Now the conditional density dens (L 2 =t 2 1 L 1 =t ] _) is the same as 
that of the first gap in the Uniform process of dropping n-1 
points on the interval [t^a]. Therefore 



dens (L 2 =t 2 | L 1 =t 1 ) = 



(n-1) (a-t 1 -t 2 ) 



n-2 



(a-t 1 ) 



n-1 



if °£t 2 £a-t 1 



otherwise 



Hence : 



dens (L 2 =t 2 ) - 



o _+- . n— 2 / . x n — 1 

a H (n-1) (a-t n -t ) n * n(a-t 1 ) 







(a-t 1 ) 



1 2 
n-1 



n 



dt. 



3 * 2 (n-l)(n) (a _ t t } n-2 d 
a n 12 1 



(n-1) (n) 
n 



n-1 



n-1 



a-t. 



J 



^ (a-t,) 11 " 1 
n 2 

a 



Therefore 1^ and L 2 are equidistributed . 
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We end with an example of the use of the law of suc- 
cessive conditioning. We compute the joint density of the 
gaps , . . . , ^ n+ 2 • 



densCL^t^ L 2 -t 2 ,..., L n + l =t n+1 ) = 

= dens (L 1 ==t 1 ) -dens (I^ 3 ^ | 1^=^) -dens (L 3 =t 3 | =t ^ ,L^=t^) 
The conditional density 

dens(L.=t j | L 1 =t 1 ,'L 2 =t 2> . . . ,L^ 1 =t j . -1 ) 

is the same as the density of the first gap in the Uniform 
process of dropping n-j+1 points on [ t^+ . . . +t.. , a] : 



(n-j + 1) (a-t 1 -t 2 -. . .-t^.) 



Therefore 



densCL^tj, L 2 =t 2 ,..., L n+1 =t n+1 ) 



— if t,+.,,+t 

n 1 

a 



otherwise 



by cancellation. 
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The joint density of any collection of gaps can be 
computed from the above formula by taking marginals with 
respect to all the other gaps. As a result we see that all 
the gaps are equidistributed . Even more is true: the gaps 
are exchangeable! At first this does not seem correct, but 
it is possible to see it intuitively if we return to the 
"points on a circle" interpretation of the Uniform process 
as in section III. 3. 

An immediate, and by no means obvious, consequence 
of the exchangeability of the gaps is that the covariance 
of any pair of them is the same as that of the first two. 
This implies paradoxically that the correlation between 
the first two gaps is the same as that between and 

for any ij 



Table of Conditioning Laws 



Probabilities 



Densities 



Conditioning 



P(A|B) = 



P(AnB) 
P(B) 



dens (X-t | B) =^ P (X<t | B) 



Continuous 
Conditioning 



P(A|Y=s) = 




= Jtim P(A|s<Y<l+e) 
e+0 



= Zim ^rr P (X<t i s<Y<s+e ) 
rt at — — 



Types of Conditioning 
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Probabilities 



Densities 



Conditioning 



Continuous 
Conditioning 



P(B)= ZP(B|A i )P(A i ) 
i 


dens(Y=s)=E dens (Y=s | A^ P {h^ ) 
i 


m 00 

P(B) = P(B|X=t)de 

) —00 


ns (X=t)dt 
dens (Y=s) = 

r 

J dens (Y=s|x=t) dens (X=t)dt 

— oo 



Law of Alternatives 



P(A ± | B) 



P(B| A i )P(A i ) 
I P(B|Aj)P(Aj) 



dens(X=x|Y.y) = dens(Y=ylx=x)dene(X=x) 



C°° dens(Y=y |x=t )dens(X=t )dt 
Bayes' Law 



P(B 1 nB 2 n...nB n ) = P (B 1 ) P (B 2 | B^ P (B 3 | B nB ) 

2 

P(B n |B 1 nB 2 n...nB n _ 1 ) 



densfX^st^, X 2 =t 2 ,..., X =t ) = 

= dens (X 1 =t 1 )dens (X 2 =t 2 |x.=t 1 ) . . . 

dens(X n =t n | X,= tl Vf'n-l' 

Law of Successive Conditioning 
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8, The Algebra of Probability Distributions 

Very early in our study of random variables we noted that 

we can perform algebraic operations on them to get new random 

variables. We did not, however, make any systematic study of 

what effect an algebraic operation has on the distribution of 

the random variables involved. For example, if X is a 

continuous random variable with density f (x) , what is the 

density of 2X? The answer is most assuredly not 2f(x). In 

1 x 

fact, the correct answer is -^f(^). This illustrates a basic 
fact about algebraic operations on random variables: the 
effect of a simple operation on a random variable is seldom 
reflected in a simple way on its density. In this section we 
consider two kinds of operations on random variables: "change 
of variables" on a single random variable and the sum of two 
independent random variables. 

Change of Variables 

Let X be a continuous random variable, whose density is 
f(x). Let g(x) be an increasing function. We wish to deter- 
mine the distribution of the random variable g(X). To do so 
we must consider the distribution function of X, not just its 
density. Accordingly, let F(x) be P(X < x), so that 
f(x) = F'(x). The distribution function of g(X) is given by 
P(g(X) ^x). Now we assumed that g(x) was an increasing 
function, so it has an inverse function G(y) which is also 
increasing. Therefore, 

P(g(X) < x) = P(G(g(X)) < G(x)) = P(X < G(x)) = F(G(x)). 
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To get the density of g(X) we differentiate this, using the 
chain rule: 

dens(g(X)=x) = F > (G (x ) )G • (x ) = f (G (x) )G ■ (x) . 
By the inverse function principle of Calculus, G'(x) = 1/g»(G(x)). 
Therefore, we have shown: 



dens(g(X)r:x) = f (G(x)) 

g 1 (G(x)) 

Change of Variables Formula 



An immediate consequence of this formula is that for any 
continuous random variable X with distribution function F(x), 
F(X) is uniformly distributed on [o,l]. Thus every continuous 
random variable is, by a change of variables, expressible in 
terms of any other. This fact can be used in computer simul- 
ations of stochastic processes. Most computer systems provide 
a pseudo-random number generator which produces, with each call, 
an independent, uniformly distributed pseudo-random number from 

[p,l] . Call this number RND. If we wish to simulate a 
random variable X whose distribution function is F(x), we 
just use G(RND), where G(y) is the inverse function of 
F(x). 

The change of variables formula we have given applies 
only to an increasing function g(x). For a decreasing 
function, the only change is that the sign of the right-hand 
side must be reversed. For more complicated functions g(x), 
one must partition the domain of g(x) into intervals on which 
it is increasing or decreasing and apply the change of variables 
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formula to each such interval. The results must then be 
combined to get the density of g(X). Needless to say this 
can get quite intricate. 

Sums of Independent Random Variables 

Suppose that X and Y are two random variables. Their 
sum is again a random variable, X+Y. For example, in the uni- 
form process, x (2) s Ij l +I, 2 " Now we ^now tne distributions 
of X and Y, can we compute the distribution of the sum X+Y? 
In general, the answer is no, for we need the joint distri- 
bution in order to compute the distribution of the sum. In 
the above example, we cannot compute the distribution of 

from the distributions of and alone, we must know also 
the joint distribution of and I^- 

On the other hand, if X and Y are independent, we can 
compute their joint distribution from their individual dis- 
tributions. As a result we expect that the distribution of 
X+Y has some reasonable expression in terms of the distributions 
of X and Y. Suppose for the moment that X and Y 
are independent, integer random variables with distributions 

P(X=n) = p n 
P(Y=n) = q n 

Then by the law of alternatives , 

P (X+Y=n) = T, P (X+Y=n | Y=k) P (Y=k) 
k 

= E P (X=n-k | Y=k) P (Y=k) . 
k 
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Since X and Y are independent, P(X=n-k|Y=k) = P(x=n-k). 
Therefore , 



P (X+Y=n) = £P (X=n-k) P (Y=k) 
k 



= Z P 
k 



n-k 3 k 



The distribution r = Z p 



n-k^k 



q, is called the (discrete) 



convolution of the distributions p„ and q . 
n 

In the case of integer random variables, we can see 
clearly what the convolution means: P(X+Y=n), is the sum 

of all possible "ways" that X+Y can equal n: X=k and Y=n-k, 
for all possible k. In the continuous case, the sum is re- 
placed by an integral, but the idea is the same. 

Suppose that X and Y are continuous random variables 
having densities 

dens(X=x) = f (x) 

dens ( Y=x) = g (x) . 

dens(X+Y=t) = [ dens (X+Y=t| X=s) dens (X=s) ds 





— oo 
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1 dens (Y=t-s) f (s) ds 



■ L 



g(t-s) f (s)ds. 



The function 



h(t) = 



g(t-s) f (s)ds 



is called the convolution of f and g, which we shall write 
h = f*g . 

We have just proved that: a sum of independent continuous 
random variables corresponds to convolution of the densities . 

The convolution of two functions is an important ope- 
ration which appears in numerous contexts, for example dynamical 
systems in engineering and optics in physics / to name just two. 
Its appearance in probability theory is perhaps the most easily 
understood context in which the convolution arises. Actually 
there are many operations similar to the one above that go by 
the name "convolution." For example if we consider the special 
case of two continuous, positive , independent random variables 
X and Y, the density of their sum is 

h(t) = f g(t-s)f(s)ds , 

Jo 

because f(s) =0 if s<0 and g(t-s) = if s>t. This is the 

form of the definition one sees most commonly. 
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Although it is not obvious from the definition, con- 
volution is a commutative, associative operation. That is, 
for densities f,g and h: 



f*g = g*f 



(f*g)*h = f*(g*h) . 



These follow from the fact that addition of random variables 

is commutative and associative, respectively. 

As an example of a convolution, we show a result which 

is implicit in many of the calculations in chapter IV: the 
sum of normally distributed random variables is again normal. 
We will just take the case of two standard normal random 
variables, and leave the general case as an exercise. By 
definition, the convolution of the standard normal density 
with itself is given by: 




=exp(-x ' /? ) • 




exp(-(y-x) /2 ) dx 




dx 




dx 



Using the change of variables t=x-y/2, this becomes: 
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-J- exp(-y 2 /2) V exp Q- (u+y/2 ) (u-y/2 )] du 
2 7T 



2 



1 exp(-y 2 /2) V exp(-u 2 +y 2 A) du 



£K7 



— exp(-y 2 /2+y 2 A) ^ exp(-ir) du 



7T 



'-<>© 



? 7T 



exp(-y 2 A)\/ir 



— - exp(-y V(2 cT ) ) , where rj = 
v/2Ticr 



* 

9. Geometric Probability 
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10. Exercises for 

Chapter V Conditional Probability 
Discrete Conditional Probability 

1. A game is played with six double-sided cards. One card has 
"1" on one side and ,, 2 M on the other. Two cards have "2" and 
"3" on the two sides. And the last three have "3" and "4" on 
them. The six cards are shuffled by one person. A random card 
is then drawn and held in a random orientation between two other 
persons, each of whom sees only one side of the card. The winner 
is the one seeing the smaller number. Suppose that the first 
person chooses the "2/3" card. Compute the probabilities each 

of the two persons thinks he/she has for winning. 

2. A person is given an urn and is told it contains 4 balls: 

2 red and 2 black. He draws two of the balls at random without 
replacing tnem, and both turn out to be red. He puts these aside. 
What is the probability that the next ball drawn is black? Another 
person in the room has been blindfolded during all of the preceding. 
After taking off her blindfold, she takes a ball out of the urn 
at random. She knows which balls were originally in the urn and 
that two have been drawn so far but does not know their color. 
What does she think the probability of drawing a black ball is? 
How could the fact that she was blindfolded have any effect on 
the probability of the next drawing of a ball? Explain. 

3. Place k balls into n boxes at random. If the first box 
is empty, what is the probability that the second is also? 
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4. During a poker game a kibitzer manages to get a brief glimpse 
of one of the hands (and no other hands) . In this glimpse he 
sees only that one card in the hand is an ace. He did not notice 
which ace it was. What is the probability that the hand has at 
least two aces? If the kibitzer noticed that one card is a black 
ace, what is the probability that the hand has at least two aces? 
Finally suppose the kibitzer saw that the hand had the Ace of 
Spades. Discuss whether such glimpses are really possible. The 
"moral" of this example is that (conditional) probabilities of 
events change considerably when one learns kinds of information 
that have no obvious relevance. 

5. inree prisoners are informed by their jailer that one of them 
has been chosen at random to be executed and that the other two 
are to be freed. They are told they will learn their fate in 

one week's time. Prisoner A asks the jailer to tell him privately 
the name of a fellow prisoner who will be set free, claiming that 
there would be no harm in divulging this information, since he 
already knows that at least one will go free, and he cannot inform 
the prisoner in question about his good fortune. The jailer 
refuses to tell prisoner A, pointing out that if A knew the name 
of one of his fellows to be set free, then his own probability of 
being executed would rise from 1/3 to 1/2, since he would then 
be one of two prisoners; and this would be cruel. What do you 
think of the jailer's reasoning? Be precise . 

6. You are playing bridge. Assume that the deck is thoroughly 

shuffled. If you receive 4 hearts, how many did your partner 

receive? Generalize to the case of receiving N hearts. 
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7. (Neyman-Pearson errors) A commuter has the choice of taking 
the train to work or of driving to work. If she takes the train 
she will get to work on time about one time in four. If she takes 
her car, she is almost certain of getting to work on time, but at 
considerable inconvenience. Although she calls the transit company 
every morning, their information is wrong a third of the time. 
So she adopts the following strategy: if the transit company says 
the train will be late, she always takes her car, and if not she 
takes her car a third of the time at random. Compute how probable 
it is that she will be late. What is the probability that she 
takes her car even though she would have been on time if she had 
taken the train? This kind of "mistake" is known as a Neyman- 
Pearson Error of Type I_. She makes an Error of Type II if the 
train is late when she takes it. Compute the probability that 
she makes an error of type II. Note that the probability of 
either kind of error is a conditional probability. Furthermore 
in order to make the above computations, one must assume a number 
of independence properties of the various events. State explicitly 
any assumptions you must make in this problem. The probability 
of an error of type I is called the significance level of the 
decision, and 1 minus the probability of an error of type II is 
called the power of the decision. One should re-examine statistical 
hypothesis testing using this terminology. When several tests 
are available one clearly would like the one with the largest 
power for a given significance level. Unfortunately, however, in 
practice one rarely knows precisely what model will be implied 
by the rejection of the hypothesis being tested, so computing the 
power of a test is not as easy as it seems. 
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8 . Suppose that the commuter in exercise 7 scores the two 
inconveniences of being late and of driving the car at 1 and 2 
respectively. What is her optimal strategy? Do the same with 
1 and 2 interchanged. 

9. An event A of positive probability is said to be favorable 
to an event B if 

P(B|A) > P(B) , 

in other words, if we know that A has occurred then the probability 
that B has occurred also is the same as it was or is greater. 
Note that if A is independent of B, then it is favorable to B. 
Suppose we have a family having two children. Let A be the 
event "the first child is a girl/' let B be the event "the second 
child is a girl," and let C be the event "the two children have 
different gender." Show that A and B are both favorable to C 
but that A O B is not. Similarly show that C is favorable to 
both A and B but not to A O B . Give examples to show that an 
event can be favorable to two others without favoring their union 
and conversely that two events can favor a third without their 
union doing so. 

10. (Simpson's paradox) A new treatment for a disease has just 

become available but is still experimental and is very expensive. 

In a teaching hospital with a large budget a random sample of 100 

patients with the disease are randomly broken into two groups, one 

having 90 patients the other 10. The larger group is treated. 30 

of these show clear improvement and only one of the members of the 

other group does. In a city hospital with a smaller budget a 
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similar test is made but now the smaller group gets the treatment. 
In this group 9 show improvement while in the untreated group half 
improve and half do not. In either case the treatment seems to 
be effective. However, if we view this as one sample of 200, 
100 of which are treated and the other 100 are not, then a dif- 
ferent picture emerges. Of the treated patients 39 improve and 
of the untreated patients some 46 improve. This seems to suggest 
that the treatment actually decreases ones chance for improvement. 
Explain the apparent paradox here. 

Bayes ' Law 

11. There are three children in a family. A friend is told that 
at least two of them are boys. What is the probability that all 
three are boys? The friend is then told that the two are the 
oldest two children. Now what is the probability that all three 
are boys? Use Bayes* Law to explain this. Assume throughout 
that boys are as likely as girls and that each child is indepen- 
dently either a boy or a girl. 

12. A student is about to take a quiz. If he studies, he will 
pass with probability .99, but if he goes to the dorm party his 
chances of passing decline to 1/2. The next day he passes the 
exam. Did he go to the party? 

13? Use Bayes* law to compute the probability in exercise 7 
that the commuter took the train given that she was on time. 
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14. Three boxes each contain two coins. One has two silver 
coins, one has two gold coins, and one has one of each. A box 

is cnosen completely at random and a coin is chosen at random from 
that box. It is gold. Is the other coin in the box gold also? 

15. The manufacturer of screws in exercise IV. 1:5* is producing 
good screws 99% of the time but now the machine that detects the 
flawed screws is itself out of adjustment, producing an incorrect 
decision 10% of the time. What is the probability that a discarded 
screw is really flawed? 

16. A lie detector test is known to be 80% reliable when the 
person is quilty and 95% reliable when the person is innocent. 

If a suspect was chosen from a group of. suspects of which only 1% 
have ever committed a crime, and the test indicates that he is 
quilty, what is the probability that he is innocent? 

17. In the optimal choice problem (exercise TTL.&t ) the correct 
strategy is to make no decision for a certain length of time (say 
k days) and then to choose the best candidate of all those seen 
up to that point. If the monarch chooses the j^-h candidate what 
is the probability that she is the best candidate? 

18. In exercise 17 above compute the probability that the monarch 
will choose the j*^ candidate using the above strategy. Use this 
to find the probability that this strategy succeeds. For which 

k will this be maximized? Generalize to N candidates. 
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19. Use Bayes' law to compute the probability of each of the 
kinds of hands in exercise II. £1 given that one has at least one 
pair. 

Continuous Conditional Probability 

20. A target is a disk of radius lm. A bullet is fired at the 
disk and hits it. Assume the bullet's mark has a uniform dis- 
tribution, i.e., the probability that it hits a region A is 
proportional to the area of A. How far from the center does the 
bullet hit? 

21. A scimitar is a sward shaped like a circular arc (at least 
for this problem) . Suppose that during a Turkish festival n 
Turks throw their scimitars independently and at random along a 
circle of circumference a. Suppose that each scimitar has arc 
length h along this circle. What is the probability that none 
of the scimitars overlap? 



22. In the Uniform process of sampling n > 2 points from [0,a], 
what is the probability that the first three gaps are all less than 
b ? 




Circle of circumference a with 
4 nonover lapping scimitars. 
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23. For l<i<j<k£n, compute the joint density of X 



(i) 



and X,. . aiven that X, .> = t 
(k) - (3) 



24 Let L , L , * * • , L _ be the gaps in the Uniform process 
1 2 n+1 

of sampling n points from [0,a]. Find the distributions of the 
order statistics of the gaps, i.e., put the gaps in order, getting 
the random variables L (1) , L (2) , L (n+1) • Then compute their 

expectations. Compare this with the scimitar problem (exercise 
21) above and with the broken DNA problem (exercise III. 53) . 

25. Compute the distribution of the median gap in the Uniform 
process of sampling n points from [0,a]. 

26. In the Uniform process of sampling n points from [0,a], 
what is the probability that the largest gap is at least twice 
as large as the smallest gap? 

27. Given positive numbers t lf t 2 , t R+1 , what is the prob- 
ability that in the uniform process, for all i=l,2,...,n+l, 
the i gap is greater than . 

28* Give a rigorous statement and proof of the identity 

dens (X = t, Y = s) 



dens (X = 1 1 Y = s) = 



dens (Y = s) 



29* Give a rigorous (e - 6) proof of the continuous law of 
alternatives . 



30. Define a cluster of size k and width e to be a sequence 
of k points contained in an interval of length e . In the 
Uniform process how many clusters of size k and width e are 
there? 5.61 



Exchangeability 

31. Drop r red points and b black points (r+b=n) at 
random uniformly on [0,a]. What is the probability that a 
run of h red points precedes a run of t black points? 

32. Compute the probability that at least one of the four 
players in a bridge game is dealt a yarborough. Note that the 
four hands are not independent but are exchangeable. Compare this 
answer with what you would get if you dealt the four hands inde- 
pendently from four different decks. Use the result of exercise 
II. iZ. 

* 

33. (Discrete Needles on a Stick Problem) . Choose k numbers 
from the set {1,2 . . . , n} at random. What is the probability 
tnat no two are closer than a units apart? Note that the 
answer depends on whether one uses Fermi-Dirac, Bose-Einstein 
or Maxwell-Boltzmann statistics. 

34. (Discrete Scimitars on a Circle Problem). Choose k numbers 
from the set of integers modulo n. What is the probability that 

no two have a difference congruent modulo n to one of the integers 
in the set {-a + 1 , . . . , -2 ,-1 ,0 ,1 , . . . , a - 1} ? As in exercise 33 
above, the depends on which kind of statistics we use. 

Change of Variables 

35. Find the distributions of the following random variables in 
terms of that of the random variable X : 

(a) Y = X + c, where c is a constant, 

(b) Y = aX + b, where a and b are constants, 

(c) Y = |x| 
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35. (Continued) 

(d) Y = t/x , where X is a positive random variable , 

(e) Y = ln(X), where X is a positive random variable 

and In denotes the natural logarithm. 

36. A point is dropped at random (uniformly) on a square of 
side a. What is the distance of this point from the center of 
the square? 

37. Let S be the speed of a molecule in a uniform gas at 

equilibrium. Then S is a positive random variable whose 

density is given by dens(S = s) = 4 x/b^/rr s 2 e bs for s>0 , 

where b is a constant which depends on the absolute temperature 

and mass of the molecule. Find the density of the kinetic energy 

1 2 

E of the molecule, E = -y m S . 

38. Suppose that a long DNA molecule of length a is broken at 
random into two pieces. Compute the distribution of the ratio 
of the length of the longer piece by that of the shorter piece. 
Compute the ratio of the expected sizes of the longer and shorter 
pieces and the expected ratio of the longer and shorter pieces. 
[Answers: 3 and «>] . Do the same for a molecule broken into 

3 pieces. 

39. (Student's t-distribution) The key fact behind much of 
modern statistical theory is the Central Limit Theorem: the 
standardization of a sum of independent, equidistributed random 
variables is normally distributed in the limit as the sample size 
gets large. Now we remarked in chapter IV that if we do not know 

the variance of the random variables, then we can approximate 



it using the sample variance. Unfortunately, if we only have a 

samll sample, the fact that the sample variance is being used 

instead of the actual variance can result in the standardized 

sum having a distribution considerably different from a normal 

distribution even if the original sequence of random variables 

were all normally distributed. This fact was first pointed out 

by William Gosset, who wrote under the pseudonym of "Student." 

We will now consider his computation. 

Let X lf * * * ' x n k e a sequence of independent, normally 

2 

distributed R.V.'s with mean m and variance a . The sample 

mean is m = — (X n + ••• + X ) and the sample variance is 
n 1 n 

— 9 1 — 2 — 2 

a = ( (X n - m) + • • • + (X - m) ) . We wish to compute the 

n - 1 1 n 

/n 

distribution of the random variable t = (m - m) . Note that 

a 

this R. V. is not defined for n = 1 . For the purposes of com- 
puting the distribution of t we may assume that X-^ • • • , X R 
have distribution N(0,1). Then for n = 2 , t is the random 

x l + x 2 

variable — . It is easy to check that X + X« and 

| Xl -x 2 | 

X - X 2 both have distribution N(0,2) and are independent. 
More generally show that t has the same distribution as the 
ratio 

X 



VY? + «" + Y 2 
1 n-1 

where X,Y n are independent and have the standard 

1 n-1 

normal distribution. The distribution of t is called the 
Student ' s t-distribution wit h n-1 degrees of freedom . 
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When n = 2 , the distribution of t is the same (up to a scale 
change) as the gangster distribution in exercise III.xx. See 
exercise xx below. Compute the Student's t-distribution explicitly 
for the case of 1 degree of freedom. 

40. Let X and Y be independent, uniformly distributed 
random variables on [0,1]. Prove that cos ( 2i\X)\/-2Xn (Y) has 
distribution N(0,1). This fact is useful for generating a 
sequence of independent, normally distributed pseudo-random 
numbers by computer, since most computers have a pseudo-random 
number generator that produces a sequence of independent , 
uniformly distributed pseudo-random numbers from [0,1], 

41. Let X-j^, X be a sequence of independent, standard normal 

random variables. Compute the distributions of the order statistics 

X , -'^X of these random variables. Write a computer 

(1) (n) 

program that uses a numerical integration to find an approximation 
for E(X(j)) accurate to 3 decimal places. Then make a table 
of E(X ) for n between 1 and 20. 

42* In a college cafeteria ice cream is available for the evening 
meal in servings that vary in weight according to a normal distri- 
bution with a standard deviation of 100 gm. The cafeteria workers 
maintain about 15 servings for students to choose from. Every 
day student A chooses the smallest serving available while student 
D chooses the largest. Over the school year (200 meals) , how much 
more ice cream does student D eat than student A? 
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4 3* Drop n points at random independently and uniformly on a 
square of a side a. How close is the point closest to the 
center of the square? 

4 4** Drop n + 1 points at random independently and uniformly 
on a square of side a . What is the distance to the nearest 
neighbor of the first point? (This is the pennies-on-a-carpet 
problem mentioned in the introduction and is currently an un- 
solved problem. ) 

Convolutions of Random Variables 

45. (Rayleigh distribution) Let X and Y be independent 
random variables having distribution N(0,a ). Find the distri- 
bution of \/x 2 + Y 2 . We can interpret this as the distribution 
of the deviation of an object from a target point when the object 
is dropped onto the target from above. X and Y are the de- 
viations in the x and y directions with respect to rectangular 
coordinate system whose origin is at the target point. 

46. For the situation described above, consider a circle and a 
square of the same area, both centered at the target point. 
Which is more likely to contain the point where the object lands i 
Hint: use probabilistic reasoning. 

2 

47. (X -distribution) 

Return to exercise IV. 19. The FDA should be just as concerned 

with variance as with the mean quantity of impurity. For example, 

if a company produces pills with an average of 4 PP m impurities 

2 

but a variance of 4 (ppm) , 31% of the pills it is producing are 
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below standard. One can test a sample variance just as one can 
test a sample mean. The distribution of the sample variance of 
a random sample of size n from a normally distributed popu- 
lation is called the chi - square distribution with n - 1 degrees 
of freedom . If the mean is known and not simply computed from 
the data of the random sample, then the distribution is the chi- 
square distribution with n degrees of freedom. See exercise 6. 

The chi-square distribution can be computed as follows. First 

2 2 
compute the distribution of X , when X is N(0,a ) . This 

2 

is the chi-square with one degree of freedom and mean a . For 

the general case let x i ' * * " ' x n be independent random variables 

2 2 

each distributed as N(0,a 2 ) . The sum — (X-, +---+X ) has the 

n x 
2 

distribution of the chi-square with mean a . Compute the 

variance of the chi-square distribution. [Answer: 3a 4 ] . Now 

for large samples, we can use the Central Limit Theorem to conclude 

that the chi-square is approximately normally distributed. What 

3a 4 

is its variance? [Answer: ] Now suppose that the FDA 

determines that 9 ppm impurity is possibly hazardous. It would 
seem reasonable to require that no more than 1 pill per thousand 
can have this much impurity. Does the drug company examined in 
exercise IV. 19 conform to this requirement? Use 95% one-sided 
confidence intervals both for the mean and for the variance. 
Note that the number of degrees of freedom in our sample is 49 
not 50. [Answer: No] 
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48. Return to exercise IV. 39. Suppose that the resisters are 
in parallel rather than series. Using a suitable normal approxi- 
mation find a 95% confidence interval for the resistance of this 
circuit. 

49? In exercise 111.36/ the gangster sprays the wall with a 
machine gun, shooting n bullets independently , each in a random 
direction toward the wall. What is the distribution of the sum 
of the positions of the n bullet holes? Assume that the median 
(III. 37J is the zero point and that distances are measured in 
meters. What is the distribution of the average position of the 
n bullet holes? Does this result explain what you observed 
in exercise IV. 58? 

50* Compute the density of the sum of n independent, uniformly 
distributed random variables on [0,a]. 

51. Give a rigorous proof of the convolution theorem. 
Geometric Probability 

52. In a circus carnival game, a player tosses a quarter onto 
the surface of a table ruled in a checkerboard pattern of two- 
inch squares, which is further subdivided into one-inch squares 

15 " 

by lines of another color. A quarter is — in diameter. If 

16 

it falls entirely inside one of the two-inch squares, the player 

receives 50C (his original quarter plus another) . If it falls 

inside one of the one-inch squares, he receives a prize worth 

about twenty dollars. Otherwise the player receives nothing. 

What is the probability of winning each prize, and what is the 
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average return on ones investment for one toss of a quarter? 

53. Choose four points at random on a circle. Call them 

X-^ , X2 r and X^ . What is the probability that the chords 
X-^X2 and ^3X4 intersect? Hint: use a symmetry argument. 

54. A captain of a ship can determine its position by using radio 
bearings from transmitters on shore. These only give a direction 
so it is necessary to use at least two such bearings to determine 
the position. However, because of errors of measurement, one 
usually takes three bearings and the position lines are then plotted 

on a map as in the example below. The ship is assumed to lie inside the 




triangle formed by the three lines. All we know about the errors of 
measurement in the three bearings is that they are independent and 
symmetric about the true bearing. What is the probability that the 
ship actually lies inside the triangle? [Answer: 1/4] . 

■+ 

55. Let X be a randomly oriented unit vector in 3-space. Show 
that the length L of the projection of X on the x-axis (i.e., the 
x-component of X) is uniformly distributed on [0,1] and that 
E (L) = 1/2 . 
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56. (Feller) What is the length of a random segment intersecting 
a unit sphere? More precisely, let P be a point on the sphere, 
and let L be a line through P in a random direction. Let S be 
the length of the intersection of L with the sphere. What is the 
distribution of S? [Answer: uniformly distributed on [0,2]]. 

->- 

57. Let X be as in exercise 55, and let U be the length of 
the projection of X on the (x,y) -plane. Show that U has prob- 
ability density f(t) = t//l - t 2 for < t < 1 and that 

E{U) =7T/4 . 

58. Let L be the length of the x-coordinate of a randomly oriented 
unit vector in 2-space. Show that L has probability density 

f(x) = 2/{tt/i-x 2 ) for < x < 1 and that E(L) = 2/tt . 

59. (Feller) Why are two violins twice as loud as one? Tnis may 
sound facetious at first until one recalls that loudness is pro- 
portional to the square of the amplitude of the vibration. The 
incoming waves may be represented by random unit vectors, the length 
being the amplitude and the angle the phase. When two violins are 
played, the resulting vector is the vector sum of the two vectors, 
but since they come from different sources we may regard them as 
being independent random vectors. Show that the expected value of 
the square of the length of the sum of the two vectors is twice the 
expected value of the square of the length of one of them. 

60. An isosceles triangle is formed by a unit vector in the 
x-direction (i.e., in 2-space either (1,0) or (-1,0)) and another 
in a random direction. Find the distribution of the length of the 
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third side. Do this both in 2-space and in 3-space. 



random unit vector 




isosceles 
triangle 



1 







(2-space) 



61. What is the probability that a random quadratic polynomial, 
2 

ax + bx + c , has real roots. Here the coefficients are independent 
and uniformly distributed on [0,1], 

62. A needle of length i is dropped on a grid ruled in a checker- 
board pattern with rulings spaced a units apart. What is the average 
number of lines the needle crosses? 

63. A planet contains five small islands which we may regard as 
five independent random points on a sphere. What is the probability 
that at least four lie in the same hemisphere? 

64. Let P and Q be two independent random points on a circle 
whose center is 0. What is the distribution of the angle POQ? Do 
the same for two points on a sphere. 

Fluctuation Theory 

65. (Epstein) A gambling house offers the following game. After 
paying an entrance fee E , a coin is tossed until the number of 
heads exceeds the number of tails. The player is then paid the 
number of dollars equal to the number of times the coin was tossed. 
What is the average amount of money that the player expects to receive? 
[Answer: infinite] 5.71 



66. A random walk in two or more dimensions is simply two or more 
independent one-dimensional random walks acting simultaneously. 
What is the probability that a two-dimensional random walk, starting 
from the origin, eventually returns to the origin? If it returns, 
how long, on the average, does it take to do so? Do the same for a 
three-dimensional random walk. [Answers: probability 1, infinitely 
long time, probability about 0.239]. 

67* If the gambling house in exercise 6 5 above has only N dollars 
available for winners, what is a fair entrance fee E for the game 
described there? 

Supplementary Exercises 

68. Rewrite the following vague conversation using the language of 
probability theory. You may assume that it is possible to distin- 
guish between "good weather" and "bad weather" unambiguously. 

"The Weather Bureau isn't always right, but I would say that 
they are right more often than not," said Alice thoughtfully. 

"Ah, but what comfort is it during miserable weather to know 
that the forecast was right? If it's wrong that isn't going to 
affect the likelihood of good weather," retorted Bob. 

"You may be right, but that doesn't contradict what I said, 
even though the forecast is pessimistic only about twice a week," 
answered Alice persuasively. 
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Chapter VI The Poisson Process 

The Poisson process is the third basic stochastic process, 
the first two being the Bernoulli and Uniform processes. It 
can be defined in many ways. We will start with a more abstract 
approach in section 1. In this section we concentrate on some 
of the random variables occurring in this process. In the next 
two sections we give a more intuitive development of the 
Poisson process based on what we already know about the Uniform 
and Bernoulli processes. Once we have thoroughly established 
the properties of the Poisson process, we then turn things 
around by showing that the Uniform process can be obtained by 
conditioning the Poisson process! 

1 . Continuous Waiting Times 

Suppose we toss a coin k times and that we get k 
tails. It is intuitively obvious that on the next toss 
there is the same probability for heads as ever: the coin 
does not remember what took place in the past. We can ex- 
press the fact that a coin has no memory in terms of the 
single waiting time as follows 

P (W 1 >k+n | W 1 >k) = P (W 1 >n) . 

The probability that one will get a run of k+n tails given 
that one has just gotten a run of k tails is simply the 
probability that one will get the additional n tails: the 
preceding tails neither help nor hurt. How long one must 
wait does not depend on how long one has already waited. 
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In real life if one is waiting for an incident to 
take place there is no abstract entity flipping an abstract 
coin during small discrete time intervals determining when 
the incident is to occur. For example one might be standing 
next to a Geiger counter waiting for a click. The waiting 
time in this case is continuous , but like the Bernoulli 
waiting time the waiting time has no memory. We express 
this using conditional probability. 

Definition . A positive continuous random variable W is 
said to have the exponential distribution when 

P (W>t+s | W>s)= p(W>t) 
for all positive t,s. 

We will also call W a continuous memory less waiting time, although we 
will see that the value of W need not represent time. The 
exponential distribution is an ubiquitous distribution ap- 
pearing in the most unexpected places. 

What is surprising about random variables having the 
exponential distribution is that the seemingly innocuous as- 
sumption we have made in defining this concept determines the 
probability distribution of W but for a single parameter. To 
see this let G(t) = P (W>t) . The condition P(W>t+s|w>s) = P(W>t) 
may also be written P ( (W>t+s)n (W>s) ) = P (W>t) P (W>s) . But the 
event (W>t+s) is a subevent of the event (W>s) . Therefore 
(W>t+s) a (W>s) = (W>t+s) . So we may equally well characterize 
a continuous memoryless waiting time by the condition: 
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P(W>t+s) = P(W>t)P(W>s) , 



or in terms of G: 



G(t+s) = G(t)G(s) . 



From this equation alone we can, using calculus, deduce 
Ct 

that G(t) = Ke for suitable constants K and C. Those who 
have seen this done before can skip the next paragraph. 

If we think of G(t+s) as a function of two variables 
t and s , we may compute the partial derivatives by the 
chain rule of calculus : 



Similarly, |— G(t+s) = G'(t+s). Next we differentiate 

o S 

G(t)G(s) with respect to both t and s: 



3__ 
3t 



G(t+s) = 



G" (t+s) 



3 (t+s) 
3t 



= G ' (t+s) 



at 



(G(t)G(s)) 



= G' (t)G(s) 



3__ 
3s 



(G(t)G(s) ) = G (t) G 1 (s) 



Therefore 



G' (t)G(s) 



= G' (t+s) = G ( t) G 1 (s) . 



Divide both sides by G(t)G(s) to get 



G' (t) _ G ' (s) 

GIFT G(s) 
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Now this must hold no matter what t and s are. Therefore 



G' (t) _ r 



for some constant C . Finally if we integrate both sides 
we get 

£n|G(t) | = Ct + D 
for some constant D , and this is the same as 



ft 

G(t) = Ke 



for some constant K. 

The distribution of W is therefore 



F(t) = P(W<t) = l-G(t) = l-Ke Ct for t>0. 



Since we must have £im F(t) = 1, the constant C must be 

t->°° 

negative. Write a = -C. Since we must also have 

£im F(t) = lim F(t) =0, the constant K must be 1 . We 

t->°o t">0 

conclude that the probability distribution of a continuous 
waiting time W is 



F(t) = P(W<t) = l-e"" at , 



for some positive constant a; and its density is 



-c/j_\ -at 
f(t) = ae 



We now see why we say that W is exponentially distributed. 



The density and distribution of a continuous waiting time. 

The parameter a may be interpreted as the frequency of the 
incidents in time: roughly speaking, a incidents occur per 
unit time "on the average." 

The power of probabilistic reasoning (made rigorous 
by conditional probability) is that we may compute the 
distribution of a random variable without referring to a 
sample space or to events of it. The distribution is defined 
purely in phenomenological terms, i.e., in terms of the 
observed phenomena only. 

Consider for example that we have a collection of 
points dropped independently and uniformly throughout the 
entire infinite plane. By "uniformly" we mean that the 
probability of finding a point in a region of finite area t 
depends only on the area t (not on its shape or location) . 
By "independently" we mean that for two disjoint regions 



f (t) 



P(t) f 





t 



t 
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the probability of finding a point in one region is inde- 
pendent of finding a point in the other. Write P(T>t) for 
the probability of finding no point in a region of area t. 
Then 

P(T>t+s) = P (T>t) P (T>s) . 
Therefore, as above, 

P(T>t) = e" at , 

where a may be interpreted as the density of the points 
dropped on the plane. It is reasonable to regard T as a 
"waiting area", i.e.; "how large an area must a region be in 
order to find a point in the region?" 

Consider next a collection of stars distributed at 
random in a large region of space. How far away is the 
nearest neighbor to a star in this region? This is quite 
similar to the above problem, but we now have three di- 
mensions. Instead of a region of some area t, we use a 
spherical volume of radius r whose center is the given star. 
If the average density of the stars is a, then 

4 3 

P (Nearest neighbor is more than r units away) = e 3 

Suppose we are in a forest with randomly located trees. 
How far can one see if one looks in one particular direction? 
By symmetry one may assume that one looks along the positive 
x-axis from the origin. Assume also, for simplicity, that 
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the trees are all p units in radius. Let T be the random 
variable "how far can one see along the x-axis?" If T is 
larger than t , then there are no centers of trees in the 
region indicated below: 



The identation on the left side is a consequence of the fact 
that one happens to be standing at that point. The area of 
the dotted region is the same as that of a rectangle of sides 
t and 2p. Therefore 



Needless to say this is an idealized model (trees do not all 
have the same radius) , but it illustrates the basic idea 

One gets a very similar model when one studies the ef- 
fect of a beam of high energy protons entering a detector 
consisting of a tank of liquid hydrogen. Here the "trees" 
are the nuclei of the hydrogen atoms, although we should 
build the model in three dimensions instead of two. 

To summarize, every "waiting time" for which the 
future does not depend on the past exhibits exponential decay. 




P (T>t) 



= e 



-2apt 
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The Gamma Distribution 

We just saw that the analog for continuous random variables 
of the random variable W 1 in the Bernoulli process is an 

exponential random variable. We now ask for the analog of the 

t h 

k th waiting time W k in the Bernoulli process. Now the k 
waiting time is the sum of the first k gaps in the Bernoulli 
process: W k = T 1 + T p + • • • + T k ; and the gaps are independent. 
Therefore we could have computed the distribution of W k by 
convolving the distributions of the gaps, all of which are 
geometric random variables. As an example, the 

distribution of W 2 is then the convolution of the distribution 
of T^, q n-1 p, with itself, i.e. 

p(w > . V q (n-k)-l pq k-l p = q n-2 p 2 _ ^ , q n-2 p 2 ^ 

2 k=l k=l 

Consider now the continuous analogue of a waiting time: 
the exponential distribution. The sum of two independent 
exponentially equidistributed random variables T 1 and T 2 may 
be regarded as the waiting time for the second occurrence, 
w 2 = T l +T 2' 3 ust as in the Bernoulli process. In the next 
chapter we shall build a more concrete model on which to 
define this random variable . Although we haven't yet defined 
the T^'s on a specific sample space, we can nevertheless 
compute the distribution of the continuous waiting time W 2 . 
It is the convolution of ae at with itself: 
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j / t t . \ -a{t~s) -as , 

dens (W =t) - ae ae ds 



= a 



e ds 



= a e 





2 -at 



r 



ds 



2^ -at 
= a te 



More generally, the density of the kth waiting time 



is the convolution 



ae 



-at * „ -at 
* ae 



. . . *ae 



-at 



k times 



W k is the sum T 1 +T 2 +..,+T Jc of k independent exponentially 
distributed random variables, all with parameter a. This 
convolution is easily computed: 

k t k-l _ at 

dens(w k =t) = jzzrrr e a • 

We call this the Gamma Distribution . Notice that it has two 
parameters: a and k. 

We end this section by computing the means and variances 
of the continuous waiting times. It is an easy exercise to 
verify that if T is exponentially distributed, then E(T) = 1/<*. 
where o( is the parameter defining the distribution of T. This 
coincides with our intuitive feeling that Of. is an "intensity." 
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The variance of 
by parts twice. 



T 



is easily computed by using integration 



Var(T) = E(T 2 ) - E(T) 2 



j.2 -at j. /1\2 
n t ae dt - (— ) 
a 



2 

ft -at t - t ^ 2 - t ioo 1 

- ° hr e " T e + -T e 1 - 

a a JO a 



2a 1 

~ ~ T 
a a 

L_ 
2 

a 



So the standard deviation a(T) is the same as the mean 
E(T) : both are 1/a. 

th 

Because the k waiting time W k of the Poisson process 

is the sum of k independent, equidistributed exponential 

random variables, the variance of W, is k/ 2. 

K a 



2. Comparing the Bernoulli and Uniform Processes 

We could at this point simply define the Poisson process 
to be a sequence of independent, exponentially distributed 
random variables, having the same parameter oL . But we 
prefer to take a different approach, which builds on the 
two processes we have already thoroughly studied. We there- 
fore now give a detailed comparison of the Bernoulli and 
Uniform processes emphasizing their similarities and their 
differences. The Poisson process will be a "limit" of both 

processes so that, in a sense, it furnishes a formal link 
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between them. In so doing we will discover some new aspects 
of both these processes. 

Parameters 

The Bernoulli process depends on a single parameter: 
the bias p of the coin. The uniform process depends on two 
parameters: the length a of the interval and the number n 
of points sampled. There is already a certain asymmetry 

here. The number of points per unit interval, a = is 

a 

called the intensity of the uniform process. Different 
uniform processes having the same intensity are quite similar, 
especially when n is large. 

Sample Spaces 

The sample space ft of the Bernoulli process is the set 
of all sequences of zeroes and ones. To every such sequence 
we can associate a set of natural numbers: the set of 
positions having ones. For example, 

(0,1,1,0,1,1,0...) corresponds to {2,3,5,6,...}. 

This gives us a new way of looking at Q . It is the set of 
all subsets of the natural numbers. 

The sample space Q of the uniform process is the set of 
all sequences (x^,X2 , . . . ,x n ) of real numbers such that 
0<x^<a. There seems to be little similarity between this 
sample space and the Bernoulli sample space. 
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Events 

The elementary events of the Bernoulli process are the 
subsets H R = (X n =l) = "the n th toss is heads." The ele- 
mentary events of the uniform process are the subsets 

(s^X i <t) = "the i th point falls in [s f t)". In both cases 
the events in general are obtained by intersections, comple- 
ments and unions from the elementary events. 

Random Variables 

Up to now we have viewed the random variables X of 

n 

the Bernoulli process and X^ of the uniform process as being 

fundamental. But there is an alternative point of view. We 

could equally well define the Bernoulli process by the random 

variables S w , the number of successes in the first n tosses, 
n 

We should write this S ^ to denote the fact that it de- 

n 

pends on the parameter p. We know that has binomial 

distribution with bias p: 

P(S I <P ) - k) = (£)p k q n - k . 

Similarly we could define the uniform process by the 

random variables U (t) , the number of points falling in 

n , a 

the interval [0,t) . These are random variables we have not 

yet seen. For each t in [0,a], U ^ (t) is an integer random 

n , a * — 

variable. In fact, one can easily see that U (t) has the bi- 

n , a 

nomial distribution, for points fall in [0,t) or in [t,a] with 

the same probability as a tossed coin with bias p = — lands 

a 

heads or tails, respectively. Therefore, 



p(u (t) = k) = (?) (|) k d4) n " k 

n , a K a a 



We shall abbreviate U (t) to simply U(t). For each 

n / a 

t, U(t) is a new random variable. When a collection of 
random variables depend on a continuous parameter, we call 
the collection a random function . Be careful not to think 
of this as a "randomly chosen function" any more than a 
random variable is a "randomly chosen variable." 

th 

Next we compare the waiting time W^ for the k success 
in the Bernoulli process with the k order statistic X^j* 
If we think of [0,a] as part of a time axis, there is clearly 
an analogy between these two random variables. Compare 
their distribution and density: 

P(W k =n) = (fc-^q P 

j , v . N ,n-l» n ,t» k-1 t N n-k 
dens(X (k) =t) = (^^-(j) (1—) 

These are quite similar indeed except that in the latter 

t n 

case a factor of — has become — , as a result of differentiation. 

a a 

We finally come to the gaps in these two processes. 
In the Bernoulli process, the gaps TV are equidistributed 
with geometric distribution: 



P(T. = k ) = q k ~V 



The gaps of the uniform process are also equidistributed, 
having the Dirichlet distribution: 



. n-1 

dena(L. = t) = J(l-J) 



The analogy between these two cases is quite striking. 

However, the analogy breaks down because the gaps T^ 

are independent, whereas the gaps are not. To be sure 

the L i try as hard as they can to be independent — they are 

exchangeable —but this is not enough. Another way to see 

the difference between the two processes is to return to the 

"fundamental" random variables S and U(t). The difference 

n 

S -S M is the number of successes between m and n. Similarly 
n m 

U(t)-U(s) is the number of points falling between s and t. 
Now if (m^,n^] and (1112^2] are disjoint intervals 

k — ( — i — t — ] b— f ) 1 )— k 

\~\ U(t 1 )-U(s 1 ) U(t 2 )-U(s 2 , 



of integers, then the random variables -S_ and -S M 

n l m l n 2 m 2 
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are independent. But even if [s 1 ,t 1 ) and [s 2 ,t 2 ) are dis- 
joint subintervals of [0,a], U(t 1 )-U(s 1 ) and U(t 2 )-U(s 2 ) 
are not independent. 

The difficulty stems from the fact that the uniform 
process is on a finite interval and a finite number of points 
whereas the Bernoulli process is not limited in this way: 
[Whether we choose to limit the Bernoulli process to a given 
finite number of tosses is irrelevant, for we can always 
continue it if we wish. The uniform process has no such 
option. We can always drop more points, but we cannot ex- 
tend the interval to a longer one without totally altering 
our process.] 

These considerations suggest that there is a third 
process that makes the analogy perfect. Just letting the length go 
to infinity doesn't work because we cannot make sense of 
sampling a single point or a finite number of points uni- 
formly from an infinite interval. We would like to have a 
process that is both uniform on an infinite interval and 
samples an infinite number of points. This works provided 
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we let a and n go to infinitely simultaneously while keeping 

the intensity a~— fixed. Intuitively, because the number of 

a 

points per unit interval remains the same, in the limit one 
will have the same intensity, and the uniform processes will 
converge to a new process . 

For example, consider the density of a gap as a and 
n become large: 



n-1 

dens(L.=t) = 

1 cL cL 



. n 

a<1 — > 
a-g.) . 

n 



x n x 

We know from calculus that lira (1+-) = e . Therefore if 

n+~ n 

ae" tot -at 

we let n tend to «>, the above expression becomes = ae 

That is, in the limit the gaps become exponentially distri- 
buted. This is exactly what we would expect, for the gaps of 
the Bernoulli process are waiting times, and we would hope 
that the gaps of the new process will be continuous waiting 
times . 
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Consider another example; the joint density of two 
gaps. In the uniform process this density is not the 
product of the individual densities. In the limit, however 
the joint density of two gaps is the product 

dens( (L 1 =t 1 ) n(L 2 =t 2 ) ) = dens ( (X (1) =t^ n (X (2) =t 1 +t 2 ) ) 



n(n-l) /, fc l +t 2 xn " 2 

T~ I 1 a~ 

a x 



, a(t l +t 2> 



a \* n 

a(t 1 +t 2 ) \ 'I 



n 



1- 



n 



-aft^+t^) 

I - ot lU...- at 2 

(1-0) 



~2 = \ ae / \ote 



of the densities. So in the new process the gaps are in- 
dependent and equidistributed, just as in the Bernoulli 
process. This helps to confirm our feeling that this is 
the correct approach. 

As a final example, we consider the limit of the 
random function U a (t) as a,n-*°°. Recall that 
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. k . n-k 
n, ,t. /, t, 



P ( u n/a (t) = k) = a~) 



(n) k ,a t> k „ at""* 



7— (^) (1~) 
in n 



j_ n-k -at 
ctt e 

As in the last two examples, we can see that (1 — — ) ■+ 

because k is a fixed integer. The limit of the first two 

(n) k (at) k 

factors is a bit harder. First write them as: • £- . 

n 

(n) k (at) k 

Then interchange denominators to get: — £— • . The 

n 

second factor is now independent of n and a. The first factor 



is : 



n(n-l)...(n-k+l) _ x . . (1 _| 1 _k^l ) _ 

n*n . . . n n n n 



Each of these factors approaches 1 as n+~. Since there are 
a fixed number of them, their product approaches 1 as n-*-°°. 
Therefore , 



Aim P(U (t) = k) = e at 

n , a K! 

n ,a-*°° 
n 

_ sr a 

a 



6.18 



Since U (t) is the fundamental random function of 
the uniform process, its limit will be the fundamental 
random function of the new process. We shall write N <t). or 
N(t) for this limit: 

k 

(at) -at 
P(N (t)=k) = \r~ e 
a • 

Notice that the distribution of N a (t) depends only on the 
product at. We call this distribution the Poisson distri- 
bution. More precisely: 

Definition . An integer random variable X is said to have 
the Poisson distribution with parameter X if 



P(X = k) = 



k 

L_ e " X if k>0 

k! 



if k<0. 



The expectation of such a random variable is 



E(X) = Z k P(X = k) 
k 



= Xe 



k=l 



CO 




(k-1) ! 



• -X X 
= Xe e 



= X 



Therefore, E (N (t) ) = at. 



If we imagine that an infinite number of points are 
spread on the interval [0,°°) with density a, then N(t) is the 
number of points that have fallen in the interval [0,t). 
The average number of points that fall in [0,t) is 
E(N (t) ) = at, and the average number of points per unit 

interval is E ^ ^ ^ = a. This justifies calling a the density 
or intensity of the process. 

Notice that we may no longer speak of which point has 

fallen first, which is second and so on. If we return to the 

uniform process for a moment, we can see why this would be so. 

Originally we used X ir X 2 ,...,X n as the defining random variables 

of the uniform process. If we use the random function U (t) , 

n ,a 

we can no longer distinguish which point is which. All we know 
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is which point is , which is x (2) ' and so on " For ex ~ 

ample X /n = t if U (t) = and U ,(s) = I for s>t. We 
(1) n,a n,a 

can recover the entire uniform process if we know all the random 

functions U, „ (t) , U~ , (t) , . . . , U (t) . But when we let 
J. / a z , a n , a 

a,n-*« we only use the random function U (t) . As a result 
the order statistics make sense in the Poisson process, but 
there is no analogue of the random variables of the uni- 
form process. 

3 . The Poisson Sample Space 

So far we have discussed the Poisson process from two 
points of view. We first considered it purely phenomeno- 
logically via the random variables and W^.. Next we con- 
sidered it as the limit of uniform processes as the length 
of the interval increases. We must reconcile these two ap- 
proaches, and we do so by an explicit construction of a model. 

The Poisson sample space 

is ft = {all rare sequences} > ur^e^e a rare se- 

Ia 

quence is a set of points A [0 ,°°) such that every finite inter- 
val has at most finitely many points of the rare sequence, 
i.e. the sequence doesn't cluster. To avoid confusion the 
points of a rare sequence are called incidents or blips . 
Don't confuse this notion with the concept of a sample point. 
The sample points of Q in this case are the rare sequences 
( not the blips) . 
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Defining Q this way seems very natural because the 
Poisson process is the limit of the uniform process (of in- 
tensity a) on [0,a] as a-*«. Unfortunately we have allowed 
n to approach infinity as well. So the intensity is not an 
intrinsic part of the Poisson sample space as it was for the 

uniform sample space, where the intensity is - . The same sit- 
uation occurs for the Bernoulli process. There the sample 
space is the same whatever the bias of our coin. It is only 
through the definition of the probability P that we can say 
that "the average number of heads in n tosses is np." In a 
similar way, we shall define a probability P on the Poisson 
sample space so that the average number of points falling on 
any interval of length t is at. 

We define the probability P on ft by means of the random 
function N ( t ) . We already saw in the last section what the 
distributions of the random variables N(t) ought to be. We 
will see in a moment how this point of view leads immediately 
to probabilities on the elementary Poisson events. All the 
distributions of the other random variables on ft will be de- 
rived from the distributions of the N(t) . We make three 
fundamental assumptions: 
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(1) For every nonnegative real number t, N(t) is a non- 
negative integer random variable whose value is the number 
of blips in the interval [0,t). More generally for s<t, 
N(t)-N(s) is also an integer random variable whose value is 
the number of blips in the interval [s,t) . 

(2) P(N(t) - HO - k) = ^^f^ e" a(t - S) , if 0<s<t. 

That is, on any subinterval [s,t) of [0,«0, the number of 
blips occurring has a Poisson distribution. Notice that we 
assume the density of the blips is independent of the 
location of the subinterval. In particular, 



P(N(t) = k) = i^e" at 



(3) If ls 1 ,t 1 ) and [s 2 ,t 2 ) are disjoint subintervals of [0,c), 
then N(t 1 )-N(s 1 ) and N(t 2 )-N(s 2 ) are independent random 
variables. In other words, what happens on disjoint sub- 
intervals are independent of one another. 

The above three fundamental assumptions implicitly de- 
fine the events of ft. Assumption (1) implies that (N (t) -N (s) -k) 
is an event of ft for all t and k. We had a different notation 
for this event in section 1.4: 
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s , t 
k 



= {all rare sequences having k blips in 

the interval [s,t)} 
= (N(t)-N(s) = k) . 



The above fundamental assumptions may be rewritten in terms 
of events as follows. 



(1) The subsets 



s , t 
k 



of Q are the elementary Poisson 



events . An arbitrary event of Q is obtained from elementary 
events by intersections, complements and unions. 



(2) 


'( 


"s , t" 


) ■ 






_ k 




(3) 




fc l " 


and 




L k J 





k 



k! 



: 2'**2 1 are independent events 
I J 

provided that [s^t^ and [s 2 ,t 2 ) are disjoint intervals. 

Unfortunately there is a problem with the definition 
of P above. How do we know that conditions (1), (2) and (3) 
do not imply some subtle contradiction? For the Bernoulli and 
the uniform processes it was quite obvious that our defini- 
tion gives a unique value for P (A) no matter how the event A 
is written in terms of elementary events. Although we have 
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given many reasons for believing that the Poisson process 
should be well-defined, we haven't proved it yet. [If you 
are willing to believe that the Poisson process is well- 
defined, you can skip lightly over the rest of this section.] 

. c 
s t 

For example, consider the event [ J ] ; i.e., the 

event that one or more blips occur in the interval [s,t). 
We could also write this as: 

E S t] ° = [S l t] u [S 2 t] u 
The probability of the event on the left hand side above is: 

puVn- 1 - p([ s ^i ) 
- i - e - a(t - s) . 

s t s t 

On the other hand, because the events [ ^ ],[ ^ ]/••• are 

disjoint, the probability of the right hand side is the fol- 
lowing. [Recall that the Taylor series expansion of e 01 ^ S * is 

1 + a (t- s ) + (2it^l> 2 +... .] 

pu^n + p([ s ^i)+... 

= a(t-s)e- a(t - s) + < a <t^>> 2 e- a(t " s) + ... 

= (e a(t - s) - l)e- a(t - s) 
= ! _ e -a(t-B) _ 



So we get the same answer either way. 

As another example, if r<s<t, then k blips occur in 
[r,t) if and only if some occur in [r,s) and the rest occur 
in [s,t). In symbols: 



k 



□ ( r r ' s i a r^i 



r t 

We can therefore compute [ ' ] in two ways. The first way 



is : 



p ([ r.t n . («<t-r))" e -a(t-r) 



The second way is the following. It is quite complicated 
and requires all of our assumptions on P. 

k 



P (( ^], =p(U [ r ; s i o ,-:*]) 



k 

I 

£=0 



I P([ r i S])p(C k-z ]) 



z 

Jl=0 



f t ~ , fl r » (alt -s )) -a(t- s ) 
(a(s -r )) -a(s -r ) . n 



11 



-a(t-r) 
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k / i. 1 , ,£ ,.. >lc-t\ -a(t-r) 

= a I tl(k-t)| (s " r) (t " s) ) e 



£( | (*) (s-r)* (t-s, k -^ e - a(t - r » 



a , , \ . /4- \\ k -a(t-r) 
((s-r) + (t-s)) e 



a k (t-r) k -a(t-r) 
kl e 



Notice our use of the Binomial formula. In any case, we again 
get the same answer either way. 

s t 

Are there other relations among the events [ £ 1 not 

obtainable from the above two examples? The answer is no, 
but this is not easy to prove: we leave this as an exercise. 
In any case we have now shown that P is consistently defined. 
One can in fact show that in some sense every possible proba- 
bility on the Poisson sample space is a perturbation of the 
probability P we have just defined (for example, a could 
vary in time) . 
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Sums of Independent P oisson Random Variables 

The Poisson process has an important property we 
now discuss. Imagine that we have two independent 
Poisson processes of intensities a and 3. To dis- 
tinguish them we color the blips of the first process 
red and the blips of the second process blue. Now sup- 
pose we are color-blind. What we will see is a 

| x^-- • • ■ X — x X X > 

Two independent Poisson processes x = red 

• = blue 

Poisson process with intensity a + 3. Let N^a^) and 

N, , (t) be the random functions of the red and blue 
blue 

processes respectively. We are saying that N^^t) + N blue * 

has Poisson distribution with parameter (a+S)t. 

Let's prove this. Since we assumed N^^^^ and 

N, , (t) are independant, the distribution of the 
blue 

N n (t) + N, , (t) is the convolution of their individual 
red blue 

distributions . 

P(N red (t)+N blue (t)=n) = J P (N red (t) =k)P (N blue (t)=n " k) 

V (<*t) k -at (3t) n ~ k -St 
= k i kl <n-k)l 

t n -(a+3)t r _n! k n-k 

= nT e k £ ki(n-k)! 01 3 
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= e- (a+S)t (a + £S) n 
n I 

( (a + g) t) n -(a+g)t 
HI 

The last expression is the Poisson distribution with 
parameter (a+g)t. Thus the sum of independent Poisson 
random variables is again Poisson. This is an important 
property of the Poisson process, and we will find some very 
deep applications of it in later sections. 

Physical Systems and the Poisson Process 

We have already mentioned several examples of exponentially 
distributed random variables in section 1 . The Poisson 
process is a sequence of independent exponentially distributed 
random variables so we shouldn't be surprised at its ubiquity. 

Geiger Counters 

The first example that comes to mind immediately is the 
sequence of clicks of a Geiger counter. If we are measuring 
the radiation of a radioactive sample, the clicks are al- 
most the blips of a Poisson process. Of course, we know that 
the intensity a will gradually decrease as the sample decays. 
However, if we choose to measure time so that a becomes 
constant, the Poisson process is an almost perfect model of 
the physical system. Even if we measure time in the usual 

units, the model is very close. 
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Quality Control 

Suppose we have a continuous assembly process (say of a 
rope or wire) and that this process occasionally produces 
tiny defects randomly on the rope. If we know the length of 
the rope and the number of defects, then the uniform process 
is a good model of this system. On the other hand if we know 
only the average number of defects per unit length (from prior 
experience) , then the Poisson process is a better model. Even 
if we also know the length of the rope to be produced, the, Poisson 

process is the better model. 

One can use such models for Quality Control. If a 
long length of rope is being produced, one can sample portions 
of the rope to determine if the number of defects per unit 
length is exceeding a specified level of acceptance, as 
might happen if an assembly machine is out of adjustment. 

For example, if the average density of the defects on 

the rope is ^ defect/foot, then the probability of no de- 
iu -i 

- 10 'OT -1 

fects on a rope of length 10 feet is e = e The 

probability of exactly two defects is ^ e = !T e * 

Blips from Space 

Suppose one is aiming a radio telescope toward one di- 
rection in the sky. The signal one is receiving is a se- 
quence of irregularly spaced radio bursts or "blips". Is 
the signal simply noise or is it a broadcast from some station? 
By comparing statistical properties of the blips with the 
known properties of the Poisson process one can distinguish 
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random noise from a broadcast signal with a certain proba- 
bility of error. 

Seeds on a Cornfield 

There is a two-dimensional model analogous to the Pois- 
son process. For a region A in the plane, let y (A) be the 
area of A. We replace the random function N(t) by a random 
function N (A) = number of blips in the region A. The funda- 
mental assumptions on N (A) are: 

(1) For every region A, N (A) is an i nteger random variable, 
the number of blips occurring in the region. 

(2) P(N(A) - k) = (a ^ A))k e~ ay(A) , i.e. N(A) has the 
Poisson distribution with parameter ay (A) . 

(3) If A and B are disjoint regions, then N(A) and N(B) are 
independent random variables, [of course, we could produce 

a model of this kind in any number of dimensions.] 

An example of such a process is the process of sprinkling 
seeds from an airplane randomly onto a field with some intensity 
a, the average number of seeds per unit area. As another ex- 
ample, we might have stars spread randomly throughout a 
large volume of space with some average density a of stars 

per unit volume. 

These are just a small selection of an enormous range of 
examples of the Poisson process occurring in nature. In 
fact it is the most common of the four basic stochastic pro- 
cesses. We shall now see how the Poisson process can enrich 

6 . 31 



our understanding of the first two stochastic processes? 
moreover it will give us a powerful tool for computing 
probability distributions in these processes. 
Gaps and Waiting Times 

We must now check that in our model the distributions 
of the gaps and the waiting times correspond to our earlier 
computations. Consider first the waiting time W k for the k 
blip. 

th 

(W^>t) means that the k blip has not yet occurred by time 
t, or equivalently that k-l or fewer blips have occurred in 
the interval £0,t). In terms of N(t): 



(W k >t) = (N(t)=0) u (N(t)=l)u . . . u (N(t)=k-1) . 



The events of the right hand side being disjoint, we may 
compute : 



P(W k >t) = P(N(t)=0) + P(N(t)=l) + ...+P(N(t)=k-l) 



2 k-l 
- . 4_ ~at (at) -at (at) -at 

= e + ate + — — e + . . . + (k_i) \ — e 

The density of W k is the derivative (P (W k £t) ) = ^(l-P (W k >t) ) . 
Since P(W fc =t) = 0, dens (W R =t)= - ~- P (W R >t)= - ^ (e ~ at + 

at e~ at +...+ ( " k ^ 1} , e~ at ) . When we differentiate the latter 

expression, each term except for the first gives rise to two terms 
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-dens (W k =t) = (-ae )+(ae 



-at 2 .e-at 



. , 2. -at 
) + ( a te 




e 



-at 



+. . .+ ( 



a 




-at 



) . 



All the terms then cancel except for the last one so that 



which is the gamma distribution. In particular, the first 
waiting time (the first gap) is exponentially distributed. 

To show that the gaps are independent and exponentially 
distributed with parameter a we use the continuous law of 
successive conditioning and the continuous law of alternatives 
in exactly the same way that we used the ordinary law of 
successive conditioning and the ordinary law of alternatives 
to compute the distributions of the gaps of the Bernoulli 
process in section V.2. The verification is left as an ex- 
ercise. 

The Uniform Process from the Poisson Process 

We built the model of the Poisson process by thinking 
of what happens to the uniform process as the length of 
the interval gets larger. We can turn this around; by 
conditioning the Poisson process to have exactly n blips 
in the interval [0,a], we get the uniform process. 

To see this we compute the conditional probability 



k t k-l 
dens (w^tj-^jy e 



-at 



P(N(t) ~ k|N(a) = n) = 



P( (N(t) =k) n (N(a)=n) ) 
P(N(a)=n) 
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Now (N(a)=n) n (N(t)=k) says that n blips occur in [0,a) and 
that k of them occur in [0,t). In other words k blips occur 
in [0,t) and n-k occur in [t,a) . These are disjoint inter- 
vals; therefore 



P(N(t)-k|N(a)=n) = P ( (N ^g}^g (a) " N (t) " n " k) } 



= P (N(t)^k)P(N(a)-N(t)==n-k) 
P (N (a) =n) 



«Xt) k -at (a(a-t) ) n k -a(a-t) 
= k! e (n-k)l e 

(aa) n -aa 
n! e 

= n! t k (a-t) n " k 

k! (n-k) • n 
a 

We recognize this as the distribution of u n a ^)» Since we 

can express any computation about order statistics in terms 

of the random function U a (t) , we can in principle compute 

n , a 

anything about order statistics by conditioning the Poisson 
process . 

For example, we can compute the densities of the order 
statistics without using a limit argument as we did in section ULS. 

dens(X (k) =t) = dens (W R =t | N (a) =n) 

dens( (W k =t) n(N(a)=n) ) 
P(N(a)=n) 
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If you feel uneasy about the use of a mixture of a density 
and a probability in the above computation just rewrite it as 

dens(W k -t)n(N(a)=n)) = $L P ( (W k <t) n (N (a) =n) ) . 
The event (W.<t) is the same as saying at least k blips occur 

K. 

in [0,t), i.e. (W k <t) - (N(t)=k)n(N(t)=k+l)n... . On the 
other hand, if we know that the k th blip has occurred at 
time t, then (N(a)=n) and (N (a) -N (t) =n-k) are the same 
event. Therefore 

dens ( (W k =t) n(N(a) -N(t)=n-k) ) 



dens(X (k) =t) = p(N(a)=n) 



dens (W k =t)P(N(a) -N(t)=n-k) 
P (N (a) =n-k) 



k k-1 n-k, , v n-k , . , 

a t -at a (a-t) -a (a-t) 

(k-1) ! 6 (n-k)! 



n n 
a a -aa 

e 



n 



n! 



(k-1) I (n-k) 1 



.k-1, . ,n-k 
t (a-t) 

n 







k-1 


k-1, 


a(a^ 





1- i 

a 



a 
n-k 



Notice that when a Poisson process is conditioned 
by (N(a) = n) , the result is always the uniform process 
of sampling n points from [0,a] no matter what the 
intensity was in the original Poisson process. This is 
further confirmation that the Poisson process is the 
process of sprinkling points at random on [0,°°). 



Af * The SchrBdinger Method 

One of the most striking applications of the Poisson 
process is to the discrete problem of counting the number 
of ways of putting balls into boxes. We consider the 
problem in full generality. That is, we want a technique 
whereby if we are given k balls and n boxes and if we 
are given any restrictions whatsoever on the occupation 
numbers, then we can compute how many ways this can be 
done. For example, one might restrict each box to contain 
zero, one or two balls. For small k and n, one could 
exhaustively enumerate the possibilities. But for k 
and n even as small as 10, this is already a very non- 
trivial problem. As another example, suppose we require 
that if the third box has an odd number of balls then the 
fifth box has a multiple of seven balls in it. We need 
a very systematic procedure if we are to give a 
reasonable solution to such a counting problem. The tech- 
nique we will develop is due to Schrodinger, and we call 
it the randomization technique for reasons we will see in 
the next section. 

We begin with a formula from calculus whose signifi- 
cance is seldom made very clear: Taylor's formula. The 
difficulty stems from a common misconception that one is 
supposed to use this formula to compute the Taylor expan- 
sion of a function. Although in principle this is possible, 
this is quite misleading. In fact one usually computes 

6 . 36 



the Taylor expansion by some other technique (and there 
are many such techniques) . One uses the Taylor formula 
to compute the values of the derivatives of the function 
at rather than the other way around 1 



Taylor 1 s formula . Every function f that can be 
differentiated infinitely many times at has a 
unique power series expansion, called the Taylor expan - 
sion of f at 0: 

f (0) + f ' (0)x + if"(0)x 2 + ••• + \ f (n) (O)x 11 + ••• 

Z n I 

The notation f ^ (0) is an abbreviation for 

-V<x) . 

J-dx — 1 x=0 

For example, if one wishes to compute the Taylor 



expansion of f (x) = / ( 1+x) , one should n ot start dif- 
ferentiating repeatedly. The best way is to use the 
binomial expansion : 

(l+x) a = 1 + (^)x + (^)x 2 + ..-., 

(a) , 
c ct ^ k 

where the binomial coefficient (^J = -~yr\ — makes sense 
for any number a as we already noted in section III. 6 
So for example 

f (x) = /{I+Z) = 1 + (^Jx + (^ 2 )x 2 + ... , 

and by Taylor ' s formula we see that 

1^ - f 1 / 2 ^ - d/2) (-1/2) (-3/2) _ _1_ 

f 1 ' " 1 3 J " 3! " 16 " 
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Suppose that we are given some conditions on the 
occupation numbers of n boxes. The number of ways we 
can place k balls into the n boxes subject to these 
conditions is n k P(B k ), where B k is the event: 

B k = "a placement of k balls into n 

boxes satisfies the given conditions 
on the occupation numbers . " 

This event is a subset of the sample space H of all 
placements of k balls into n boxes, each being equally 
likely. 

To compute P(B k ) we construct "physical" boxes from 
an interval of length [0,1] by cutting it into subin- 
tervals each of length Vn. Then P(B k ) is the probab- 
ility that n points dropped 



1/6 1/3 1/2 2/3 5/6 1 
6 boxes made from [0,1] 

at random uniformly on [0,1] will satisfy the given 
conditions on the occupation numbers. This converts any 
balls-into-boxes problem into a computation of the 
probability of a certain event of the uniform process. 

Now comes the important step. We ask a different 
question. what is the probability that for a rare sequence 
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of blips on [0,«>) those blips falling in [0,1] satisfy 
the given conditions? Call this event A. At first it 
appears that the computation of P (A) is a much harder 
problem, but we shall see that it is actually much easier. 
Once we know P (A) , we still do not yet know P(B k ) because 
the events are in completely different processes. 

However, we just saw that whenever we condition the Poisson 
process by the event (N(a) = k) , the result is a Uniform 
process. Therefore, 



and this holds no matter what the intensity a of the 
original Poisson process. The relationship between P (A) 
and the condition probabilities P(a|n(1) = k) is given by 
the law of alternatives: 



P(B k ) = P(a|n{1) = k) , 



oo 



P(A) = I P(A|N(1) = k)P(N(l) = k) 
k=0 



= I P(B k ) 
k=0 K 



00 




e 



-a 



ct 

If we multiply both sides of this equation by e , we get 



the important formula: 




The left hand side is a function f (a) of the variable a 
because A is an event of the Poisson process, which depends 



on the intensity a. The probabilities P(B k ), on the other 
hand, do not depend on a. Therefore, by the Taylor formula, 



P(B k ) = 



, k 

da 



f (a) 



->a=0 



or 



P(B k ) = 



d k (P(A)e a ) 



da 



a=0 



Consider the following example. Suppose that the con- 
ditions on the occupation-numbers are that $^ be zero or 
one for all boxes i. We computed P(B, ) for this case in 

To compute this using the 



section III. 2: p ( B ^) = — 

n 

randomization technique we must first compute P(A), where 
A is the event "either no blips or just one blip occur in 
each of the n subintervals of length ^1 of [0,1]." If 
we write A^ for the event "box i (i.e. the subinterval 
[ (i-1) /n, i/n) of [0,1] has either no blips or just one 
blip," then 

A = A. n A„ n • • • n A . 

12 n 

In terms of elementary Poisson events, 

= (o £N(i) - N (i^) < 1). 





pi-1 i-i 




pi-1 i-j 


A i = 


n ' n 



U 


n ' n 

1 


So PtA^ = e" 


a/n 


+ a e -a/n 
n 



a-\ -a/n 
— e 



Now comes the crucial step. In the Poisson process, the 
number of blips in disjoint intervals are independent of one 
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another. This is not true for the uniform process, and it 
is not true for occupation numbers of balls into boxes. It 
is; this fact about the Poisson process that makes this 
technique so effective. P (A) is easy to compute because 
the computation of P(A i ) is a routine application of the 
definition of the Poisson process, and P (A) is the product 
P(A a )P(A 2 ) • • *P(A n ) • 



P(A) = (1 + 



cu -a/n 



(1 + 

^ n J 



-a/n 



• • • • • 



(1 + "}e 
v n J 



cl\ -a/n 



n factors 



n 

= (! + -)< 

v n ; 



-CL 



n 



Therefore P(A)e a = (l+^) and P(B R ) = 



L-da 



^ n J 



a=0 



The formula P(B k ) = 



n- 

(1 + SL) 

^ n-* 



is a perfectly 



L- da" " — h-0 

legitimate answer to this problem. We can, however, put 

r a*\ ^ 

this into a nicer form by first expanding [1 + -J using 
the binomial formula and by doing little rearranging: 



(i + «f = 

v n J 



5 M) k 



k=0 



n (n), k 
V k a 

k=0 ki n k 
n (n) k a k 



k=0 n 
(n) 



k ki ' 



Therefore P (B k ) 



, exactly as we got previously 



n 



It appears that this is the harder way to compute this answer, 
but that is only because this example is special. In most 
cases this technique is considerably easier. 

As a harder example consider the event = "every box 

has at least one ball in it." We computed P (B^.) in section 
IV* 8, using the inclusion-exclusion principle. This was 
quite an elaborate argument. To compute ^f^) using 
randomization, we first compute P (A) , where A = "all n 
boxes of [0,1] have at least one blip." As above, let 
A^ = "box i has at least one blip." Then 

i-1 



A. = 
i 



■ -t c 
i 



n 



' n 

— ' 



so that P(A i ) = 1 - e 



-a/n 



Hence 



tw-iv \ /n -a/n. n , tw * \ a -a/n.n a _ , a/n ,.n 

P (A) = (1-e ' ) and P (A) e - (1-e ) e = (e -1) . 

Therefore 



P(B k ) = 



d (e a/n - 1) 



n 



da 



a=0 



Now we could leave the answer in this form, but we can derive 
a better expression by using the binomial formula and the 
well-known Taylor expansion of the exponential function. 



(e 



a/n 



- I)" = l (■)<-!) j (e a/n ) n_j 



j=0 
n 



= I ( n )(-i)V n - j)a/n 



j=0 
n 



j=0 D ^k=0 * 
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j=0 J ^k=0 



jaQ k=0 J 



r n 



I I C)(-D 3 (i 

k=OAj=0 J 



. k k> 
k! 



. k k 
n } k! 



kA 



a 
k! 



P(B k ) 



j-0 J " 



Therefore, P (B k ) = 



For another example, suppose we want every box to have 
at most two balls. By now one should be able to compute this 
immediately: 

2 , n 

2n 



P(A) 



= ( e " a/n + - e" a//n + — - e" a>/n ) 



P(Ale a = 



P(B k ) = 



(1 + £L. + JM 
1 n 2n 



2 n 

2 



da k 1 n 2n 2J 



ri 



-a=0 



Summary 

To compute P(B k ), where B k is the event "a placement 
of k balls into n boxes satisfies certain given condi- 
tions on the occupation numbers," we do the following: 
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Let A be the event of the Poisson process that 
"the rare sequence of blips on tO, 00 ) has the property 
that the blips falling in [0,1] satisfy the given 
conditions on the occupation numbers. 1 * Box i is now 
the subinterval I - —— - r —) • 



Compute P(A). This will be a function of a. 
Generally this will not be difficult to compute because 
what happens in disjoint intervals of the Poisson 
process are independent of one another. 



Apply the formula: p t B ^) 



P(A)e a 

da 



One can 

a=0 



often apply some formulas such as the binomial formula, 

Ci • 
to expand P(A)e thereby deriving another expression 

for P(B k ). 
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5. Randomized and Compound Processes 

Randomization is a general term for the method whereby 
new stochastic processes are created by allowing a para- 
meter of a given stochastic process to be chosen randomly 
according to some distribution. 

Randomized Uniform Process 

For example, consider the uniform process of sampling 
n points from the interval [0,a] . Suppose that instead 
of sampling a fixed number n of points we sample a random 
number N of points. That is, we consider the two-step 
process : 

1) Choose a number N of points to be sampled, 
according to some probability distribution: 

P(N = n) = p 

n 

2) Once N is known, sample that many points 
from [0,a] according to the uniform process. 

We have replaced the fixed number n of points by the integer 

random variable N, having probability distribution 

P (N = n) = p . The result is a new stochastic process, the 
n 

randomized uniform process . 

Every question we have asked about the ordinary uniform 
process can now be asked for the randomized process. To 
compute the answers in this new process, we use the law of 
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alternatives. For example, we might ask what the probability 
is for exactly k points to be in the interval [0,t] . 
Call this event A . We want to compute P (A, . ) . By the 
law of alternatives, 

oo 

P(A ) = I P[A |N = n}P(N = n) . 
K,t n=0 K,t 

Now P (A, . |n = n) is the probability that exactly k points 
are in [0,t] but in the ordinary uniform process of samp- 
ling a fixed number n of points from [0,a], We computed 
this probability back in section VI. 1, where we denoted it 

by P(U a (t) = k) . 
n , a 

I ■ — I 1 

Ota 
^ , * * ' 

k points n-k points 

P(A k(t |N = n) = P( Va (t) = k) = ©@ k (l - |)"" k . 

Therefore the unconditional probability P (A, ) in the 

K , t 

randomized uniform process is 

00 j. k- ■ n-k 

' n=0 

If the probabilities p n have a nice form, then it may be 
possible to simplify this expression, but normally the 
answer to a question about a randomized process will be in 
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the form of an infinite series. 

There are two reasons why one would randomize a 
stochastic process. The first is that it allows one to 
produce more general models of phenomena, which can be more 
realistic reflections of the phenomena being studied. We 
shall see examples of these in the exercises. Perhaps more 
important is the second reason: randomization can be used 
as a very powerful and effective computational tool. In 
fact this is one of the most important uses of probability 
theory. Problems that cannot be solved by direct means can 
be solved by allowing certain parameters to be random vari- 
ables. The technique of the last section is just one of 
these . 

Consider another example in the randomized uniform 
process. Let B t be the event "in the process of sampling 
N points uniformly from [0,1], all the points appear in 
[0,t]." Then 

CO 

P(B t ) = I P(B |N = n)P(N = n) , 
n-0 

by the law of alternatives. Now P(B t |N = n) is the prob- 
ability that all the points of the ordinary uniform process of samp- 
ling n points from [0,1] occur in [0,t]. Therefore, 
P(B t |N = n) - t n , since we have chosen the length of the 
interval to be 1. Therefore, 

00 

P(B t ) = I t"p 

n=0 
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This is a function of t that is usually called the 
generating function of the sequence "tp n ^» 

The technique of generating functions is important 
both in probability theory and in other branches of mathe- 
matics. Unfortunately it is usually defined by fiat with 
little motivation beyind saying that it is useful. Using prob 
ability theory we see more intuitively how it arises. Namely, 

given a sequence {p n } forming a probability distribution, 

°° n 

the generating function f (t) = n £ Q p n t of the sequence 
is the probability that a random number of points, the number 
chosen according to the distribution p n # when sampled from 
the unit interval, all occur in the subinterval [0,t]. In 
other words, the generating function is a way of studying 
a sequence {p n } by setting up a certain experiment using 
the sequence ^P n ^ and by studying the properties of this 
experiment. This is the underlying reason why this 
technique turns out to be so useful. 

Randomized Poisson Process 

Now consider the Poisson process. Since this process 

may be regarded as being the uniform process as n and a 

approach infinity but with a = fixed, we see that the 

a 

intensity a is the analog in the Poisson process of the 
number of points sampled in the uniform process. The 

randomized Poisson process is a Poisson process but with a 
random intensity A (capital alpha) instead of a fixed 
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intensity. More precisely, this process is again a two-step 
process : 

1) Choose an intensity a according to the 
density function g ( a) = dens (A = a) of 
the positive continuous random variable A. 

2) Observe a rare sequence of blips in the 
Poisson process having the chosen intensity. 

Let T be the waiting time for the first blip in the 
randomized Poisson process. To compute the distribution of 
T, we use the law of alternatives but this time the con- 
tinuous version. 



P(T > t) = 



P (T > 1 1 A = a) dens (A = a) da. 





The conditional probability P(T>t|A = a) is computed in 
the ordinary Poisson process with intensity a- Therefore 



{ QO 

e g(a)da. 





P(T > t) = 

The function g(t) = /~e" at g(a)da is called the Laplace 
transform of the function g(a). 

The Laplace transform is an important technique in 
engineering and in the sciences as well as in mathematics. 
We now see why. If we are given a function g(a) forming 
the probability density of a positive random variable, we 
can study g(ot) by setting up an experiment and then 
studying the properties of the experiment. The experiment 
consists of waiting for the first blip of a Poisson process 



whose intensity is chosen according to the density g (<*} . 
The probability distribution of this experiment is 1 ~ §(t) f 
where g(t) is the Laplace transform of g(a)» 

As a simple example of this point of view, we can 
explain an important property of the Laplace transform: the 
Laplace transform of the convolution of functions is the 
product of their Laplace transforms: 




Suppose that f(a) and g(a) are the densities of indepen- 
dent random variables A and B. Their convolution is the 
density of the sum A + B. Let T be the waiting time for 
the first gap in the Poisson process with random intensity 
A + B. We can compute P(T > t) in two ways. Since the 
density of A + B is f*g, we know that P (T > t) = f*g. 
On the other hand, we may view the event (T > t) in 
another way. Sprinkle blips on [O, 00 ) with intensity A 
and then with intensity B. Then (T > t) = (T^ > t) H (t b > t) , 
where T is the waiting time for the first A-blip and 
T B is the waiting time for the first B-blip. Since these 
two kinds of blips were sprinkled independently, 

P(T > t) = P(T, > t)P(T D > t) = f(t)g(t). Therefore 
A 

f*g = fg. 

Yoga Randomizing by an integer or a continuous random vari- 
able results in a "generating function" or a "transform" of 
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the distribution or density, respectively. 

All transforms can be given a probabilistic (possibly 
quantum probabilistic) interpretation, The Fourier trans- 
form is perhaps the deepest example of a transform, since 
it is intimately connected with quantum mechanics. 

Finite Sampling Processes 

The finite sampling process or balls-into-boxes can 
also be randomized. Namely we choose a random number K of 
balls and then place them randomly into the n boxes. If 
we can make a judicious choice of distribution for K, then 
we can possibly make computations in the finite sampling process 
easier. It was Schrbdinger ' s observation that a good choice 
for the distribution of K is the Poisson distribution. 

The reason that the Poisson distribution works so well 
is the fact about the Poisson process noted earlier: if we 
combine two independent Poisson processes with intensities 
a and 3^ the result is a Poisson process with intensity 
a + 8. In terms of the Poisson distribution this says that 
if X and Y are independent Poisson random variables of 
parameters A and y, then X + Y has Poisson distribution 
with parameter \ + y . We simply reverse this. Suppose 
that K has Poisson distribution with parameter a. Then 
K - K x + K 2 + + K n , where the are independent 

Poisson random variables each with parameter a/n. The 
randomized finite sampling process then "splits up" into 
n independent randomized finite sampling processes. Each of 
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K 




K l K 2 



K 



n 



ULJLJ i_J randomize i_j l_j split up K , M , 

n boxes n boxes n boxes 

Ordinary finite Randomized finite n independent 

process process randomized finite 

processes 



these randomized finite processes consists of placing a 
random number of balls into one box. In the last sec- 

tion we used a more specific model. There K was N(l) , 

the number of points occuring in [0,1) in the Poisson 

1 1 1 

process of intensity a. Then N(l) = N 1 (-} + N 2 (-} + • • • + N n (-) 

is a sum of n independent Poisson random variables each of 

intensity a/n, where N. f-) is the number of points occurring 

in [(i-l)/n,i/n) , i.e. N. (±)- = N - N {—^) . 

By randomizing, we made non-independent random variables 
(the occupation numbers) independent. We return to the 
non-randomized process by conditioning the randomized one, 
using the law of alternatives. 

A generalization of this process immediately comes to 
mind. We could just as easily drop balls into boxes of dif- 
ferent sizes. That is, such that the balls are not equally 



likely to fall into the various boxes. The SchrcSdinger 
technique works just as well in this case; the only change 
required is that K be split into a sum , where 

is Poisson with parameter p^a# being the prob- 

ability that any given ball falls in box i. This is the 
physicists' model of a classical statistical mechanical 
system. Here p^ is related to the energy of the state rep- 
resented by box i. 

6* Reliability Theory 
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Distribution 


Type 


Pa r ame t e r ( s ) Mode 1 ( s ) 


Exponential 


continuous 


a 


W 1 or any T^ in 
the Poisson process 


wCLJ.LU.LLGL 


11 L -L. 1 1 U. fcJ 


a, k 


W, in the Poisson 
k 

process 


Poisson 


integer 


A 


N(t) in the Poisson 
process, where 
A = at. 


Distribution 


Distribution or Density 


Mean Variance 


Exponential 


f(t) = 


-at 

ae 


1/oC 1/c* 2 


Gamma 


f(t) = 


k k-1 

at -at 


k/oc k/oc 2 


(k-1) ! 


Poisson 


P k = 


A k -A 

kT e 


A A 




Table of Poisson Distributions 


Fact If N 


and M are 


independent 


Poisson random variables 



whose parameters are A and y, respectively, then N + M 
is also Poisson but with parameter A + y. 
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Bernoulli 



Poisson 



Uniform 



p = bias 


a = intensity or 
average number of 
blips per unit interval. 


n = no. of points sampled, 
a = length of interval. 
^ = intensity. 


„ . th 

X. = OUtCOTE Of 1 tOSS. 
1 

independent 
equidi str ibuted 
Binomial distribution 




= i point sampled, 
independent 
equidistr ibuted 
Uniform distribution 


S = no. of successes in 
n 

first n tosses . 
Binomial distribution 


N(t) = no. of blips 

in [0,t). 
Poisson distribution 


U(t) = no. of points 

in [0,t). 
Binomial distribution 


th 

= k waiting time. 

Negative binomial 
distribution 


th 

= k waiting time. 
Gamma distribution 


X„ v = k order statistic, 
(k) 

Dirichlet distribution 


.th 

T\ = i gap. 
independent 
equidistr ibuted 
Geometric distribution 


. th 

T. = i gap. 
independent 
equidistr ibuted 
Exponential distribution 


.th 

1^ = 1 gap. 
not independent 
exchangeable 
Dirichlet distribution 



Table of Analogies: Bernoulli, Poisson and Uniform Process 
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7. Exercises for 

Chapter VI. The Poisson Process 

1. You are the captain of the Bicentennial Eagle , a spaceship 
that has just returned from hyperspace to ordinary space, only 
to encounter the debris of a recently destroyed planet. The 
debris consists of essentially spherical rocks 20m in radius. 
The destroyed planet was originally the same size as the earth, 
and its debris is now uniformly scattered throughout a region 
10 km in radius. Your manuvering jets are temporarily out of 
order. If you are headed directly toward the center of the 
debris, what are your chances of getting all the way through 
the debris without a collision? Assume that your ship has a 
circular cross-section of radius 10m. Explain any assumptions 
you may be making. 

4 

2. At 5 x 10 km/hour how long would you have in exercise 1 

to repair your manuvering jets before your chances of a collision 
reach 10%? Explain precisely what you are computing in this 
problem . 

3. A beam of protons is accelerated to high energy and is deflected 
so that it encounters a pool of liquid hydrogen. The tracks of the 
protons in the beam are visible in this detector, and one can 
easily see where a proton in the beam collides with a proton in 

the pool of liquid hydrogen. Describe how far a given proton 
travels before it collides with a proton in the pool. 
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4. There exist enzymes that attack only a certain nucleotide 
sequence in a chromosome. Describe a means of testing whether or 

not a given nucleotide sequence appears randomly in a given chromosome. 

5. In the birthday coincidence problem (exercise II. 9 ) , the paradox 
comes from thinking that one is looking for another person with the 
same birthday as your birthday. Compute the distribution of the 
number of persons chosen at random you must ask until you find one 
with the same birthday as yours. What kind of distribution is it? 
Find an exponential distribution that approximates it. Compute 

the average value of this random variable as well as the number of 
persons one must ask in order to have a 50% chance of finding one 
with your birthday. 

6. How does the answer to exercise 5 change if we include February 
29th as a possible birthdate? 

7. Roughly speaking, the relationship between the birthday problem 

in exercise 5 and the birthday coincidence problem in exercise II. 9 

is that in either case we have a certain number of pairs of persons 

from which we look for a birthday coincidence, but that in the former 

problem we consider a collection of pairs all of which have one given 

person in common whereas in the latter we consider all pairs from a 

set of persons. For example we saw that in a class of 2 3 students 

there is about a 50% chance of a birthday coincidence. Such a class 
-jo 23*22 

has ( 2 ) = ' = 253 pairs of students. Compare this with 

the last part of your answer to exercise 5. 
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8. In a large class the students call out their birth- 
days until someone in the class finds that his or her birtnday has 
been called. Technically this is not a random variable since it is 
possible that no pair of students have the same birthday. However, 
if we assume that a match will eventually be found, then it is a 
random variable which is approximately exponentially distributed. 
Find the parameter for this exponential distribution, and compute 
its mean. Compare with exercise III. 25. 

9. A sociobiologist wishes to test whether or not birds of a certain 
species practice territorial spacing of their nest locations. Com- 
pute the distribution of the distance of a given nest from its nearest 
neighbor. Use this to formulate a statistical test. 

10. Let Y^, Y^/ ' " \ ' Y n be n independent, exponentially distributed 

random variables, each with parameter a . Compute the order statistics 

Y /-.\ <Y /n\ < **• < Y. of these random variables. One can do this 
(1) - (2) - — (n) 

in two ways. Either change variables and convert to a Uniform process 
(see section V.8) or use a modification of the reasoning used 

in exercise III. 53 which was made rigorous in section V.7 ("needles 
on a stick problem") . 

11* Compute the expectations E(Y^ ) in exercise 10 above. Compare 
with the expectations of the order statistics of the gaps in the 
Uniform process (exercise III. 53). 

12* (Feller) Three persons A, B and C arrive at a post office 
simultaneously. There are two counters, and these are taken immediately 
by A and B. Assume the service time of a given individual is ex- 
ponentially distributed with parameter a . Assume also that different 
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Chapter VII Entropy and Information 

That probability is closely connected with information 
should come as no surprise after problems such as exercise 

(the jailer paradox) . What entropy does is to make this 
connection precise. In section 1 we discuss entropy for 
finite-valued random variables. In the next section we give 
a dramatic application of the law of large numbers 
to information theory: the Shannon Coding Theorem. Finally 
in section 3 we turn to the case of continuous random variables 
and prove that essentially all the interesting distributions 
we have seen in probability theory may be defined by entropy 
considerations . 

1 . Discrete Entropy 

We will start by defining entropy for integer random 
variables taking only finitely many values. Later in a 
step-by-step procedure, we will extend the concept to continuous 
random variables. 

Partitions . 

A random variable is said to be a finite - valued random 
variable if it takes finitely many values. For example, S R 
in the Bernoulli process is finite-valued since it can only 
take on values from to n. 

If X is a finite-valued random variable whose values are 
l,2,3,...,n, then X determines the events (X=l) , (X=2 ),..., (X=n) . 
Moreover, every outcome of ft is in exactly one of these events. 
We call this situation a partition of ft: 
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In general, a partition tt of ft is a collection of nonempty 
events B lf B 2 ,...,B called the blocks of tt such that: 

(a) no two blocks intersect, 

(b) every sample point is in some block, i.e. 

u B i = . 
i 

The only difference between a random variable X and a par- 
tition tt is that a random variable consists not only of a 
partition but also of a label (the value X takes on that block) 
for each block. The partition tt(X) defined X is the par- 
tition whose blocks are (X=l) , (X=2) , . . . , (X=n) , i.e. tt(X) is 
obtained by ignoring the particular labels that X attaches to 
the events it defines. 

More generally suppose that we have a number of finite- 
valued random variables X^,...,X . The smallest events that 
one can define by these random variables are the events 
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(x 1 =i 1 ) ri (x 2 =i 2 )o ... rt(x r =i r ) , 



and any event definable by the random variables X^,...,X 
is necessarily a union of some of the above events. The 
partition whose blocks are the above events is called the 
( joint ) partition tt (X-^, . . . ,X r ) defined by X^, . . . ,X ■ The 
partition ir (X^# . . . r X ) is related to the partitions 
tt(X^), tt (X 2 ),..., it (X r ) by means of the operation on partitions 
called the meet . In general if o and t are two partitions, 
whose blocks are C^,C 2 ,..., and , D 2 , . . . , respectively, 
then the meet of a and t, written oat is the partition whose 
blocks are C^/*>D^ whenever they are nonempty. In terms of 

the meet, ir(X 1 ,X 2 , ,X r ) = ir (X^att (X 2 )a . . .Att (X r ) . 

As the joint distribution of random variables determines 
everything about their "correlation" so the joint partition 
of a set of partitions determines their correlation. In par- 
ticular, it is easy to see that independence of random variables 
is really a property of the partitions defined by them. Let o 
and t be two partitions. We say a and t are independent if and 
only if 

P(CrkD) = P(C)P(D) 

for all blocks C of a and D of t . When a and t are independent 
we can display the sample space Q as a "checkerboard" 




typical block of x 



typical block of a 



whose rows are blocks of x, whose columns blocks of a, and such 
that the "area" is proportional to the probability. 

The meet is the analog for partitions of the inter- 
section of sets. There is a whole algebra of partitions 
analogous to that for sets. For example, there is an analog 
of set union called the join of partitions and written avx. 
We leave it as an exercise to decide how this ought to be 
defined. We will not have need of this particular operation. 

Another notion from sets is that of subset , and its analog 
for partitions will be very important for us. We say that a 
partition a with blocks C lf C 2 ,..., is finer than a partition 
x with blocks D lf D 2 , . . . ,D m if every block C i of a is contained 
in some block D.. of x. We write a<x for this relation. If X 
and Y are finite-valued random variables, then tt(X)<tt(Y) means 
that an observation of X is sufficient to determine anything 
one might ask about Y. The technical term for this relation 
is that X is a sufficient statistic for Y. More generally, 
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if X, ,X~....,X are a collection of finite-valued random 
1 2. n 

variables such that ir (X^ , . . . ,X ) is finer than tt(Y), we say- 
that X^,X 2 ,...,X n is sufficient for Y. In practice one often 
finds that in a particular experiment one wants the value of 
Y but that the random variables one actually measures form a 
sequence X^X^... . If for some n, X 1 , X 2 , . . . , X n is suf- 
ficient for Y, then one can in principle compute Y from the 

measurements of the X's. One also says that X X 

in 

code for Y. 
Entropy 

The reason for introducing partitions is that the "infor- 
mation content" of a finite-valued random variable X is a 
property of the collection of events defined by X and not by 
the particular labels X happens to assign to these events. 
We now make this precise. The entropy of a partition it whose 

blocks are the events B,,B or ...,B is defined bv 

1 2. n 

H 2 (tO = J:p(B ± ) log 2 ( p ig v ) , 
i i 

where by convention 0»log 2 (^) is defined to be 0. The entropy 
of a finite - valued random variable X is the entropy of its 
partition: 
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H 2 (X) = H 2 (tt(X) ) . 



We remark that log 2 could be replaced by log^ for any 
base b>0. The only effect on H 2 (it) is to multiply by the 
scale factor log b (2), i.e. we merely alter the units in which 
the entropy is measured. The use of log 2 is traditional. In 
this case we say that H 2 (tt) is measured in bits . More gene- 
rally, we will write H b (tt) for ZP(B i ) log b ( p ^ ^ ) . If we 

use H(tt) without a subscript we mean that the base b should 
be taken to be e, the base of the natural logarithms. We 
say that H(tt) is measured in nats (natural digits). 

Consider the example of tossing a biased coin with bias 
p, i.e. consider a partition consisting of one or two blocks. 
If p is 1, then we know for certain that the coin will always 

show heads. In this case H 2 (X) - 0«log 2 (^-) + l-log^^-) = 0. 
Entropy zero corresponds to total certainty . Now suppose 
that p is somewhat less than 1. The toss is now somewhat less 
predictable, and we find that the entropy is a small positive 
number. As p decreases, the entropy gradually increases, 
reaching a maximum when p = 1/2 . For a fair coin 
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H 2 (X)= ^-.log 2 (2) + y-log 2 (2)= ^ + j = 1 bit. Finally, as 

p decreases from 1/2 to 0, the entropy again decreases to 
zero; for now the toss is becoming increasingly predictable 




More generally suppose that it has n blocks. Shannon 
proved that H 2 (tt) takes its maximum value precisely when all 
n outcomes are equally likely. In this case the entropy is 
H 2 (tt) = log 2 (n) bits or H(tt) = Jln(n) nats. We will now prove 
this. All of our later characterizations of distributions 
having maximum entropy rely on the same basic technique we 
will use in this case. The key fact is this inequality: 
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Basic logarithmic inequality 



This fact is easy to prove using Calculus: f (u) = &n(u)-u+l 
has derivative f'(u) = i - 1 so f'(u)>0 for u<l and f , ( u )<0 
for u>l, i.e. f (u) takes its maximum value at u=l. 

Now compare H(ir) to £n(n) using the above inequality: 



H(tt) - £n(n) = 



n 

S P(B.) 
i=l 



in 



1 

pTbTJ 



- fcn(n) 



n 

I P (B. ) In 
i=l 



' 1 ^ 



P (B . ) n 



n 

< E P(B.) 

~ i=l 



P (B. )n 
l 



- 1 
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n , n 
I (£) - l P(B.) 
i=l n i=l 1 



=1-1=0. 



The fact that the probabilities of the n blocks add up to 1, 
n 

Z P(B.) = 1, is used twice above: in the first equality and 
i=l 1 

in the second last one. In any case we find that if tt has n 
blocks then H(tt) < Jln(n). 

When is equality possible? In our derivation of 
H (tt) < Jin (n) , equality can fail in only one of the steps: 



Jin 



1 



_ i p t — < ^ - 7- t — - 1 for all i. Now the basic logarithmic 
P (B i )n I - P (Bj^n y 

inequality tells us that this will be an equality if and only 

if t> ; P \ = 1 for all blocks, i.e. P (B. ) = i- for all i. This 
P (B^) n 1 n 

completes our proof. 

When X has a partition it all of whose blocks have the 
same probability, we say that X is completely random or 
totally random , although this is not quite the best terminology 
One should really say that X has maximum uncertainty . (Equiva- 
lently, the measurement of X gives one the maximum information 
about the outcome of an experiment, of any random variable 
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having the same number of outcomes.) It is unfortunate 
that it has become standard terminology to describe such 
random variables as being simply "random". For example, 
one often says "choose a card at random" rather than "choose 
a card completely at random", as if there were no other way 
to choose a card from a deck. In fact most "random" suffles 
of a deck are far from being completely random (see exercise 4); 
as a result, choosing a card or dealing a hand is not totally 
random and the probabilities computed in exercise III. 4 would 
seldom be achieved in an actual game. On the other hand, the 
terminology suggests that they are. This is the price one 
pays for using a vague, imprecise language to describe 
probabilitic concepts. 
Properties of Entropy 

So far we have discussed examples of the entropy of some 
random variables. Although these examples provide some moti- 
vation for our definition of entropy they leave unanswered 
the basic question of why this formula and not some other is 
the one we use to define entropy. We will now consider why 
our formula is the only possible one. We will do this by 
finding three self-evident properties that ought to hold for 
any reasonable measure of information (or entropy) . It then 
turns out that our definition of entropy is the only one that 

satisfies all these Droperties. 
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We begin with the most obvious of properties. As we have 
defined it, H is a function of partitions of the sample space. 
However, it should be clear that we want H to depend only on 
the set of probabilities of the blocks of the partition. In 
fact, we want H to depend only on the positive probabilities 
which occur. Moreover, we want H to be a continuous function 
of these probabilities. This is a convenience only. We 
could, with a great deal of effort, derive continuity from 
other more complex conditions; but we would rather concentrate 
on the important issues. We summarize the conditions on H we 
have just described before going on to the difficult question 
of conditional entropy. 

Entropy property 1. An entropy is a function H defined on 
sets ^p 1 ,p 2 ,... ,p n ^ of nonnegative real numbers, which 

satisfy V^+V^" ' + P n = 1 - 

Entropy property 2. If H is an entropy function, then for any 
set £ Pl ,p 2 , . . . ,p n | on which H is defined, H satisfies: 

H(p 1 ,p 2 , . . . ,p n ,0) = H(p 1 ,p 2 ,. . . ,p n ). 

In other words, H depends only on the nonzero p ± , s in a given set. 
Entropy property 3. An entropy function is continuous. 

There are two ways to think of the concept of conditional 
entropy, and the fact that they are equivalent is our next 
property of entropy. To illustrate the ideas involved, we 
consider the following simple weighing problem. We have 
three coins, some of which may be counterfeit (but not all). 
Counterfeit coins are distinguishable from normal coins by 
the fact that they are lighter. We are given a balance scale, 



and we wish to find out which, if any, of the coins are 
counterfeit. The sample space for this problem consists 
of seven sample points, one for each possible set of good 
coins. We denote them as follows: 

Sh = f 1 > 2 > 3, 12, 13, 23, 123}. 

Now what happens when we put the first two coins on each 
side of the scale? The sample space is partitioned into 
three blocks corresponding to the three possible outcomes 
of the weighing: a = £|2, 123,3} , {2,23} , / 1 , 1 3J^ . After 
recording the result of this weighing, we then place the 
second and third coins on the two sides of the scale. The 
result of this second weighing is to partition each of the 
blocks of the first weighing: 

?12,123,3l becomes fl2f, /l23?, $ J>] 
Ib,23] becomes 

£l,1 3? becomes hj > f^l . 
The combined information of the two weighings is represented 
by the partition into seven blocks, each with one sample point. 
Call this partition 7T . Conditional entropy is concerned with 
the effect of the second weighing, given that the first has 
occurred. One way to analyze this is to look at each block 
cr^ of the partition of the first weighing and to analyze the 
situation as if c\ were the whole sample space. In general, 
for an event A and a partition T we define the conditional 
entropy of T given A, written H(T |a), to be the entropy of 
the partition T-jOA, T2nA,... that T induces on A. Thus 

in the above weighing problem we have three conditional 
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entropies, one for each possible outcome of the first weighing: 
H(7T 1^), H(-7T|cT 2 ) and tt(ir\cT^). The conditional entropy 
of 7T given & is then defined to be the average of these. 
More precisely, if TT and cr are any two partitions of a 
sample space /TL such that TT is finer than <r , we define the 
conditional entropy of 7T given O" to be the average value of 
H( TT IcT^) over all blocks (J ± of CT : 

h(ttIo-) = 5lP(<r i )H('7r \<r ± ). 

On the other hand, we would like to think of information 
as a "quantity" that increases as we ask more and more questions 
about our experiment. Therefore, the conditional entropy of TT 
given <T ought to be the net increase in entropy from <7 to TT . 
In other words, we require our entropy function to satisfy: 
Entropy property if. If TT is a finer partition than <T , then 

H(ir | <T) - H(T) - H( (T). 

The last property we require is one that we have already 
discussed. The partition having maximum entropy among all 
partitions with a given number of blocks is the one for 
with all the blocks have the same probability. 
Entropy property 5« If H is an entropy function, then for 
any set jTp^ , p 2 , . . . , P n ^ on which H is defined, H satisfies: 
H(P 1 ,p ? , . . . ,P n ) < H(^, ^, . . . , 

We are now ready for the following remarkable fact: if 
H satisfies the above five properties, then H is given by the 
formula introduced earlier in this chapter, except for a 
possible scale change. 



Uniqeness of Entropy If H is a function satisfying the 
five properties of an entropy function, then there is a 
constant C such that H is given by: 

H(P 1 ,P 2 i • • • ,P n ) = c Sp i log 2 (p i ). 

Proof 

The proof is rather technical, so we suggest omitting 
it on the first reading, returning to it later. We first 
apply property 4 to the partition consisting of just one 
block: itself . By definition H(*oJ*£L) is the same as HGfZ). 
Therefore, H(il) = E(JX) - Hfeft) = 0. 

1 1 1 

We now define a function f(n) by H(— , n^" We have 

just shown that f (1 ) = 0, and we want to calculate f (n) in 
general. Using properties 2 and 5> we show that f(n) is 
increasing : 

f(n) = H(l,..., 1) = H(l,... 1,0) < HC^ly,..., 1J i T ) = f(n+1). 

k- 1 

rtext we consider a partition cr consisting of n blocks 

each of which has probability ^ ^ . Then subdivide each of 

n 

these into n parts, each of which has the same probability, 
call the resulting partition TT . The conditional entropy 
H("7T \ for each block o"^ is clearly given by f(n). Thus 

the conditional entropy H(ir|cy) is f(n). By property Z+, 
f(n) = H(ir|<T) - H (tt ) - H(<T) = f(n k ) - f(n k " 1 ). Therefore, 
if we apply this fact k times, we obtain: f(n k ) = kf(n). 
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Now fix two positive integers n and k. Since the 
exponential function is an increasing function, there is 
an integer b such that: 2 b < n k <• 2 b+1 . We now apply the 
two facts about f(n) obtained above to this relation: 

f(2 b ) < f(n k ) < f(2 b+1 ) (f is increasing) 

bf (2) < kf (n) < (b+1 )f (2) 

Now divide these inequalities by kf(2): , ■ 

b . f (n) b+1 
£ 1 7T27 - "k™ * 

Now apply the increasing function log-p to the inec ! ualities 

2 b < n k < 2 b+1 . This gives that b < k log ? (n) < b+1 . If we 

divide these by k we obtain: 

b . -, / \ ^ b+1 
£ < log 2 (n) < 

It follows that both f(n)/f(2) and log 2 (n) are in the interval 



This implies that f(n)/f(2) and log^(n) can be no 



1 

farther apart than the length of this interval. But n and 

k were arbitrary positive integers. So if we let k get very 

large, we are forced to conclude that f(n)/f(2) coincides with 

log 2 (n). Thus for positive integers n, we have: 

f(n) = f(2)log 2 (n). 

We will define the constant C to be -f(2). Since f(2) > f(1) = 0, 

we know that C is negative. 

We next consider a set j p^ , p^ , . . . , p n J of positive 

rational numbers such that P 1 +P2 +, * ,+ P n = 1 • Let N be their 

common denominator, i.e., p i = a^/N, for all i, where each a ± 

is an integer and a.,+a~+«**+a = N. Let <T be a partition 

I d n 
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corresponding to the set of probabilities ^ p^ , p 2 , . . . ,P n } . 
Let i\ be a partition obtained by breaking up the i block 
of 0" into a^ parts. Then every block of 7T has probability 
1/N. By definition of conditional entropy, H(TT | CT^ ) = f(a^) 
and H(ir|cT) - 2Pi H(,7r I a "i ) = ?Pi f(a i ) = -C Sp i log 2 (a i ). 

By property if, on the other hand, we have: 

H(irl<T) = H(-TT) - H((T) = f(N) - H(cT) = -Clog 2 (N) -H(CT). 
Combining the two expressions for H(ir I 0" ) gives us: 
H(C) = -Clog 2 (N) + C 2p i log 2 (a i ) 

= c -2 P ± log 2 (N) + SPi log(a i )J 
= C [2p. (log 2 (a i ) - log 2 (N))] 

- C [ 2p ± log 2 (a./N)] 

i 

= c 2 p, iog(p. ) . 

i 

By continuity (property 3), H must have this same formula 

for all sets ^p^ , p 2 , . . . , p n J on which it is defined. This 

completes the proof. 

We leave it as an exercise to show that the above formula 

for entropy actually satisfies the five postulated properties. 

We conclude by giving an interpretation of independence of 

partitions in terms of conditional entropy. Intuitively if 

1C and <T are independent then their joint entropy H(TA^ ) 

is the sum of the individual entropies: H(TT ) + H(<T ). In 
terms of conditional entropy, this says that H(TTa<tI <T) = H(fT ) . 



2 . The Shannon Coding Theorem 

A consequence of Entropy property k of the last section 
is that if we wish to answer a question X by means of a se- 
quence of questions , r . - . r S f the joint entropy of 
S-^ , S2 , . . . , S n must be at least as large as the entropy of X, 
and hence the sum of the entropies of the S^'s must be at 
least as large as the entropy of X. In particular, if the 
S^'s are yes-no questions, then H 2 (S^)<1 and we get the crude 
inequality n>H2 (X) . The problem of finding a set of suf- 
ficient statistics for a random variable X is called the 
coding problem for X, and the sequence S-^ , S 2 r . . . , S n is said 
to code X. As we will see in the exercises, the kinds of 
questions one may ask are usually restricted to some class 
of questions. Devising particular codes is a highly nontrivial 
task. 

One of the reasons that coding is so nontrivial in general 
is that one is usually required to answer a whole sequence 
of questions X^,X2,... produced by some process, and as a 
result one would like to answer the questions in the most 
efficient way possible. Consider one example. Suppose that 
X takes value with probability 0.85 and takes values 1 
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through 200 each with probability 7.5 x 10 . Then H 2 (X) 
is less than 1. Simply by counting one can see that at 
least 8 yes-no questions will be needed to achieve a suf- 
ficient statistic for X, even though the entropy suggests 
that one should be able to determine X with a single yes-no 
question . 

Shannon's Theorem states that for any finite-valued 
random variable X, it is possible to encode efficiently a 
sequence of independent copies of X provided that: 

(1) one encodes a block X^,X2,...,X n all at one time, 

(2) one is willing to accept a small probability 

of error, e>0, that a block is incorrectly coded, 
such that e can be made arbitrarily small. 
Since one frequently encounters sequences of random variables 
in actual practice, it is not unreasonable to encode them 
in blocks. The small probability of error is also accept- 
able since it can be made arbitrarily small. Consider for 
example the random variable X mentioned in the preceding 
paragraph. Since H2(X)<l f Shannon's Theorem says that there 
is a block size n such that a sequence of n independent 
copies of X, X, ,...,X , can be encoded with a sequence of n 
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yes-no questions S^ r ... r S . Consider that the sequence of 

' s can take one of 201 n values, while the sequence of 

S^ f s takes on at most 2 n possible values and you will begin 

to appreciate Shannon's Theorem. 

We must first make precise the idea that a sequence of 

random variables "almost" codes for another sequence. Let 

X^,...,X n and S^,...,S r be two sequences of random variables. 

We say that S, ,...,S is almost sufficient for X, .....X 
J 1' r 1' ' n 

with confidence l-£ if there is an event A such that 

(1) P(A) = l- e 

(2) S 1 | A, . . . ,S r | A is sufficient for | A, . . . , X | A, 
where X^|A is the random variable conditioned 
by the occurrence of A. 

Put another way, condition (2) says that the joint partition 
tt(S^)a ..-A7r(S r ) when restricted to A is finer than 
tt(X^)a . ..A7r(X ) when restricted also to A. 

Shannon 1 s Coding Theorem . Let X^ , X2 , . . . , be_ a sequence of 

independent equidistributed finite- valued random variables 

such that H 2 (X^) = h. For any e>0 no matter how small and 

any 5>0 no matter how small , there is an integer N such that 

for any block size n>N, one can find a sequence , , . . . , 5> ^ n+ ^ n j 

of [hn+5n] random variables each taking two values , which is 
almost sufficient for X^ , X 2 , . . . r X n with confidence 1-e . 
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The confidence 1-e represents the probability that the 
S^'s are able to code for a particular sequence of values 
of , X2 , . . . ,X n . The expression [hn+6n] stands for the 
smallest integer larger than hn+6n. Finally, by entropy 
considerations we know that at least [nh] S's will be needed 
to code for X lf X 2 ,... f X . The additional 5n S's represent 
an extra set of S's beyond those required by entropy, but 
they can be chosen to be as small a fraction of the total 
set of S's as we please. 

Proof . One begins by defining a sequence of random variables 

Y-,,Y 0/ ... by decreeing that if X. takes value n then Y. takes 

value P(X^=n) . For example, if X^ took values l,...,n each 

with probability — , then Y. would take value i- with prob- 

n 1 n ^ 

ability 1. 

These random variables have two properties we need. 
The first is that the Y i 's are independent. This is an im- 
mediate consequence of the fact that the X^ 1 s are so. The 
second fact is that the expected value of log 2 (l/Y^) is h, 
the entropy of X^. To see this we simply compute: 

E(log 2 (1/Y i ) ) = £ log 2 (1/PtX^n) jPtX^n) 

n 

= H 2 (X i ) = h , 

since log 2 (1/Y^) takes value log 2 (1/P (X^n) ) when X^ = n. 
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The sequence log 2 (1/Y^) , log 2 (1/Y 2 ) , . . . is a sequence 
of independent equidistributed random variables each with 
mean h. By the Law of Large Numbers, 



log 2 (l/Y 1 )+log 2 (1/Y 2 )+. . .+log 2 (lA n ) 
fcim = h 



n 



= 1. 



Now log 2 (l/Y 1 )+log 2 (1/Y 2 )+. . .+log 2 (1/Y n ) = log 2 
Therefore : 



Y 1 Y 2* • ,Y n 



Um n" log 2 ( y.y* .Y 
n-*-°° 12 n 



) = h 



= 1. 



Thi 



is says that for n large enough the expression — log 2 



Y Y 

1*1*2 ' 



will be as close to h as we please with as high a probability 
as we please. The probability we want is 1-e, and we want 



n 



log. 



Y 1 Y 2 ' • ' Y n 



to be within 5 of h with this probability 





f 


r > 






p 


— log- 
|n ^2 


1 

Y, Y_ . . . Y 


- h 


<5 






12 nj 




j 



As one might expect, the event A in the definition of a set 

of almost sufficient statistics will be the above event: 
( 



A = 



i log. 



Y lY '.Y 1 " h l<* 



« (I log. 



Y Y— • • ■ Y 

L 1 2 n ) 



- nhl <n6) 



= (-n6<log 2 



^ 1 2 n 



- nh <n6) . 



Exponentiating every term in the above pair of inequalities 
preserves the inequalities so 



/0 -n6 ^ 1 „-nh ~n6 . 

A = (2 <=-^ — -2 <2 ) 

I 1 I 2- " * n 



_ , 9 -n6+nh 1 < 9 n<5+nh* 

X ^ X a ■ ■ • X 

1 z n 



= ( 2 - nh+n,S > Y. Y, . . . Y > 2- nh_n6 ). 

1 2 n 

We are now ready for the crucial step in the proof. We 
count how many blocks of the joint partition tt (X-^) a it (X 2 )a ...*^(X n ) 
are contained in the event A. Suppose that there are r such 
blocks; call them B ir B 2 ,...,B . Each of these blocks is of 

the form (X^i^ r\ < x 2 =i 2 ^ n ' * * n (X n =i n* ' If we sum the 
probabilities of all such events we get 1: 



P( (X 1 =i 1 ) n (X 2 =i 2 )n •••^< x n =;i n ^ " 1 - 



1 l' ,, " 1 n 



Since the ' s were assumed to be independent, this means that 
I P(X 1 -i 1 )P(X 2 =i 2 ) ...P(X n =i n ) = 1. 

Now each of the above factors is, by definition, the value 
that the corresponding takes: 
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v =i 12 n 

X l ^1 

X 2 =i 2 



X =i 
n n 



If we sum just over those blocks contained in A we find that 



E Y, -Y • -Y„ < 1 , 

1 z n — 



j=l 



each Y^ taking the value appropriate to the blocks B.. . But 

for these blocks we know that Y,Y ...Y >2~ ~ . Hence 

l z n 



E 2 " nh " n6 « Z Y, 'Y '"Y < 1. 
. , .12 n — 

r _ „ y. _ jf 

Now the terms of the sum E 2~ do not depend on the 

j = l 

block B . . So we find that 
1 

r#2 -nh-n6 < x 
or that r<2 nh+n6 , 

i.e. there are fewer than 2 blocks in A. 
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We are now ready to code the random variables 
XjyXj/.-.fX . Number the blocks in A in binary using [nh+n6] 
binary digits starting with 00... 01 and ending with 11... 11. 
Note that because r<2 nh+n6 , we will have at most 2 [nh+n<S] -l 
blocks in A. We assign the binary number 00... 00 to all 
blocks outside A. 

The random variable S i is defined to be the i th digit 
of the block in which the outcome occurs. By definition 

S 1' S 2' * * ' ' S [nh+n6] ' wnen restricted to A, are sufficient 
for X, ,X~,...,X when restricted to A. When all the S.'s 
take the value 0, we are unable to determine the values of 
X^ ,x 2 , . . . ,X n , but when the S^'s take any other set of values, 
we can compute all the values of X-^ , X 2 , . . . , X n . 
This completes the proof. 

The usual form in which one sees this theorem is called 
the Shannon Channel Coding Theorem. The problem here is to 
transmit information through a noisy channel. The channel 
we consider is called the Binary Symmetric Channel. Each 
bit of information one transmits through the BSC is either 
left alone or changed. The probability that it is changed 
is p, the same for all bits, and each bit is altered or not 
independently of the others. The BSCeisi equivalent to the 
Bernoulli process, coin tossing, with bias p. 
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Transmission through the BSC proceeds as follows. A 
message k bits long is first sent through an encoder where 
it is changed into a string of n bits. This string of bits 
is then transmitted through the BSC to a decoder that con- 
verts the received n bits into 



k bits 



Bob encoder 



n bits n bits 



BSC 



decoder 



k bits 



Alice 



a string of k bits, which we hope is the same as the original 
message. The problem is to design the encoder and decoder 
so that the probability of error per transmitted bit is smal- 
ler than some preassigned value and so that the redundancy n - 
is as :smAll as possible. Equivalently , we want the rate of 

transmission — to be as high as possible. 
n 

We may think of the noise as a sequence of Bernoulli 
random variables X^,X 2 ,...,X that are added to the signal. 
Let h = H 2 (X i ) = p log 2 (i) + q log 2 (i) . Then the input 
signal plus the noise constitute a total entropy of k+nh 
bits. The decoder can ask at most n questions about the 
data it receives, since it receives just n bits of data. 
From these n questions it must determine both the noise and 
the original signal, hence k+nh£n. Put another way, the 
decoder may ask just n-k questions in order to determine the 
noise and eliminate it. Thus k+nh<n or n-nh>k. Hence l-h>k/n. 
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This says that the rate of communication through the BSC can 
never be greater than 1-h. One calls 1-h the capacity of 
the channel. 

The Shannon Channel Coding Theorem says that for any 
rate r less than the channel capacity 1-h it is possible to 
choose k and n so that k/n>r and to design an encoder and 
decoder so that the (average) probability of error per mes- 
sage bit is as small as we please. The proof is very similar 
to the proof we just gave for the Shannon Coding Theorem. 
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3 . Continuous Entropy 

We now consider what entropy means for continuous 
random variables. The concepts in this case are by no means 
as self-evident as in the case of a finite-valued random 
variable. 
Relative Entropy 

The most obvious way to begin is to try to "f initize" . 
Let X be a continuous random variable taking values in some 
finite interval. For simplicity take this interval to be 

[0,a]. Now exactly as in Calculus, we partition (or sub- 

. th 

divide) this interval into n blocks ,B 2 , . . . , B n . The i 

block is the subinterval [(i-l)a/n, ia/n) . Define a new 

random variable Y that takes value (i-l)a/n whenever X 

n 

th 

takes a value in the block B^ ■ We call Y n the n truncation 
of X. We show why we use this name by means of an example. 
Suppose that a=l and that n = 1000. Now imagine that we per- 
form our experiment and that the outcome X is 

.1415926... . 
The value of Y 1000 in this case would be 

.141000. . 

i.e. we truncate the value of X to 3 decimal places. Clearly 
the truncations of X will be better and better approximations 
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to X as n-^. Moreover the truncations are finite-valued 
random variables. 

One might add that in practice one always uses a truncation 
in an actual experiment. It is only in our idealized mathe- 
matical models that one can speak of an arbitrary real number. 

Now compute the entropy of Y . By definition of Y n , 



Y =s 

n 



(i-l)a 



n 



= P 



< x < i* 



n 



n 





r • \ 




1a 


■ = F 




1 n J 







where F(t) is the probability distribution of X. Thus 



n 

H ( Y ) = Z P 
n i=l 



Y ^(i-lLa 



, ( 
In 



{ 1* « J J 



n 

= E 
i=l 



The crucial step in the computation is the mean value theorem 
of Calculus: if F is dif f erentiable on the interval [s,t], 
then for some x between s and t 

F' (x) (t-s) = F(t)-F(s) . 



We apply this to each block Each block has length — , so 
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H(Y n ) 



n 

Z F ' (x. ) 
i=l 



1 



n 



in 



F ' (x . ) a/n 



n 

= Z 
i=l 



F* (x i ) ^ In 



F' (x,) 



+ Z F 1 (x . ) ^ Jin (^) 
. , in a 
i=l 



where each x L is some point in the block B i . If we write 
f(t) = F»(t) for the density of X, then the first term above 



is 



n 

Z f(x. ) in 
i=l 



' 1 ^ 

fTxTT 



This is just the Riemann sum for our partition of [0,a]. So 
as n-* 00 this approaches 



f(x) in 







dx 



Next consider the second term above. We may write this as 



n 



i=l 



Except for the factor in^), we would have a Riemann sum for 



f(x)dx = 1. However the factor in means that as n+°°, 



H(Y n ) 



f(x) In 







1 



dx + in(n) - in (a) . 
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So as n+°° , H(Y )-h». 

The difficulty is easily seen. As we partition [0,a] 

into finer and finer blocks, the random variable Y„ is 

n 

taking an enormous number of values, some fraction of which 
are roughly equally likely. This is an artifact of our sub- 
division process and ought to be eliminated. We do this by 
measuring not the absolute entropy of Y n but rather the dif- 
ference between the entropy of Y n and the maximum possible ^ 
entropy of a random variable taking n values. We call this 
the relative entropy of Y n : 

Relative entropy of Y n = H (Y n ) - £n(n). 

In other words, instead of measuring how far Y n is from 
being completely certain, we measure how close Y n is to 
being completely random. For finite-valued random variables 
these two ways of measuring entropy are equivalent, but when 
we take the limit as n+«>, only the relative entropy converges 
We therefore define: 



Relative entropy of X - $,im (relative entropy of Y R ) 

n-*-°° 



a 

f (x) In 





FTxt] 



dx - Jin (a) 
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For this notion of entropy, the case of total random- 
ness will be represented by a relative entropy of zero. 
Less uncertain random variables will have a negative relative 
entropy. Continuous random variables can have arbitrarily 
large negative entropy: complete certainty is impossible for 
continuous random variables. 

Which continuous random variables will have entropy 
zero? In other words, what is the continuous analogue of 
the equally likely probability distribution? To answer this 
we proceed as we did for finite-valued random variables. Let 
X be any continuous R.V. taking values in [0,a]. Then 

■a 



relative entropy 



of X = f 



f (x) in 



dx - Jin (a) 



f (x) in 



dx - in(a) 



■a 

J 



f (x)dx 



f (x) [in 





ra 

f (x) in 2-7— r — 
tHxTaj 



- £n(a)]dx 



dx 



ra 



f (x) [W^ l]dx 

f (x) a 



•I 







dx 
a 



ra 







f (x)dx 



1-1 = 0. 
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Now the logarithmic inequality tells us that the above in- 
equality is an equality if and only if ^\ i = 1 or f (x) = — . 

r / a a 

In other words , the maximum entropy occurs precisely when X 
has the uniform distribution on [0,a]. 

Boltzmann Entropy 

The notion of relative entropy is fine for random 
variables taking values in a finite interval, but most con- 
tinuous random variables we have seen do not have this property. 
The most natural way to try to extend entropy to arbitrary 
continuous random variables is to use a limiting process similar 
to what we used for extending entropy from finite-valued random 
variables to finite-interval random variables. We will do 
this first for positive random variables before going on to 
the general case. 

Let T be a positive continuous random variable (i.e. 

P(T<0) = 0). For a>0 f we define the restriction of T to [0,a] 

to be the random variable T^ = T | (T<a) . By this we mean that 

T takes the value of T conditioned on the occurrence of (T<a) . 
a ■ " 

We already saw this in the definition of almost sufficient 
statistics. The probability distribution of T & is given by 
P(T <t) = P(T<t|T<a) , and the density of T a is then given by 
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dens (T =t) = 

3l 



fa 



fit)/ 




f(u)du, if 0<t<a 







, if t>a or t<0, 



where f(t) = dens (T=t) . Write C = — 

a r a 



for the above 



f (u)du 







normalization constant. Clearly T will be a better and 
better approximation to T as a-" 00 . As with truncations, re- 
strictions are always used in an actual experiment. 

It would be nice if we could define the relative en- 
tropy of T to be the limit of the relative entropy of T a as 
a-K», but unfortunately this diverges: 



ra 



relative entropy of T a = 



C f(t)£n 
l a 



C f (t) 



dt-£n (a)-*- 00 



as a-t-» . 



As before the difficulty is that we are not measuring en- 
tropy properly. The case of total randomness, entropy zero, 
is the uniform distribution on [0,a];but as a-*°° this distri- 
bution ceases to make sense. So we are attempting to measure 
the entropy of T relative to that of a nonexistent distribution! 
What should we do? We no longer have either total certainty 
or total uncertainty from which to measure entropy. 
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What we do is to "renormalize" our measurement of en- 
tropy so that the entropy of the uniform distribution on 
[0,a] is In (a) rather than 0. We do this by analogy with 
the "equally likely" distribution on n points whose entropy 
is £n(n). There is no really convincing justification for 
this choice of normalization. The entropy defined in this 
way is called the Boltzmann or differential entropy : 



H (T) = £im [ (relative entropy of T fl ) + An (a) ] 

a+°° 



£im 
a-* 00 



r 



C f(t)£n 
a 



cTrTty 

k a 



dt 



f (t) In 







r i 

TVtT 



dt, 



if this improper integral exists. The same definition works, 
in fact, for any continuous random variable. 

We now ask which positive continuous random variables 
take maximum Boltzmann entropy. Let T be such a R.V. , and let 
y - E(T) be its expectation. To bound the entropy of T we 
use a method known as the Lagrange multiplier method. This 



method is appropriate wherever we wish to maximize some 
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quantity subject to constraints. In this case the constraints 
that the density function f(t) of T must satisfy are: 



f(t)dt = 1 and 



tf (t)dt = y. 







Multiply the constraints by constants a and 8 to be determined 
later and subtract both from the entropy of T. Then proceed 
as in all our previous maximum entropy calculations: 



H(T) -a~Bu = 



f (t) Jin 







fit) 



dt - a 



f (t)dt-B 







Jo 



tf (t)dt 



f (t) [In 







fl 1 

fity 



-a-8t]dt 



f (t) In 







(m>e a+Bt , 



dt 



f (t) I " 1 



f(t)e 



a+8t 



dt 



e- a ~ Bt dt - 







f (t)dt 



r -a-Btl 
e 

E 



- 1 



•a 



| 1 (if 8>0) . 
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By the basic logarithmic inequality, the above inequality 

2_ — ex *~ 3 ^- 

is an equality if and only if ?r . = 1 or f (t) - e 

f(t)e a PTI 

We now use the constraints to solve for a and 3: 



1 = | f(t)dt = 




tf(t)dt - 
>0 



5 —fit 

e-°-St dt = 2_I 



te- a - Bt dt = <* 



Therefore B = e~ a = g 2 y or 3 = 1/y = e" a . The function f(t) 
thus has the form 

f( t) = i e ~ t/v , 

i.e. T is exponentially distributed with parameter 1/V . 
Moreover, the entropy of T is H (T) = a+3y = l+£n(y). There- 
fore we see that as y gets large T can have arbitrarily high 
entropy. Thus there is no positive random variable having 
maximum entropy among all such random variables. 
Standard Entropy 

The reason we had to specify the expectation of a posi- 
tive random variable in order to find the one having maximum entropy 
arises from an important distinction between finite entropy 
and Boltzmann entropy: the choice of units in which we 
measure our random variable alters the Boltzmann entropy but 
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has no effect on the finite entropy. Indeed, the entropy of 
a finite-valued random variable depends only on the partition 
it defines. For example, if X is uniformly distributed on 
[0,1], then Y=2X is uniformly distributed on [0,2], Al- 
though Y represents the same phenomenon as X, the difference 
being the units with which we measure distance, obviously an 
observation of X is more certain than an observation of Y (one 
bit more certain to be precise) . More generally, for any 
continuous random variable X,H(C X) = H(X) + £n(C). 

In order to speak of the entropy of the phenomenon repre- 
sented by a random variable, independent of scale changes, 
we introduce yet one more notion of entropy. The standard 
entropy of a random variable is the Boltzmann entropy of its 
standardization. Using the notion of standard entropy we can 
ask an important question. Which continuous random variables 
have the maximum standard entropy? The answer is that, up to 
changes of scale, there is exactly one such random variable 
and it is a random variable we have not yet seen before: the 
normal distribution. This random variable forms the basis 
of the Wiener process, the last of the four principal stochastic 
processes of probability theory. 

We now compute this random variable. Since wo want the 
random variable to have maximum standard entropy, we may as- 
sume it is standard. Let X be such a random variable, and 
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lot f (x) be its density. We maximize H(X) by the method of 
Lagrange multipliers we used above, but now there are three 
constraints : 



f (x)dx = 1, 



xf(x)dx = 0, 



2 

x*"f (x) dx = 1 



One of the constraints being zero we need just two para- 
meters a and 8 to be determined later: 



II (X) -a-B = 



f (x) In 



TJx) 



dx - a 



f (x) dx-3 



x f (x) dx 



) _oo 



dx 



f (x) in 



f (x)e 



a+$x' 



dx 



f (x) 



> -00 



- 1 



■« -a-Bx 



f(x)e a+ ^ 
2 



dx 



dx - 



f (x)dx 



-a i-oo 2 , 
e | -u , - 1 
e du 



/8 >- 



(u = /B x) 



-a r- 

! /tt 



- 1. 



/B 
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The fact that 



-u 



du = /? is a standard fact from Cal- 



culus . The proof proceeds as follows. Let A = 



/•oo 2 

-u j 
e du . 



Then since u is just a dummy variable, A = 
Hence 



s v dv as well. 



A 2 = 



,00 



-u , 

e du 



2 



e dv 



f °° 2 2 
-u -v 



du dv . 



J — 00 ' 



2 2 2 

Now switch to polar coordinates. Then r = u + v , dudv = rdrde 
and the limits of integration are 0<9<2tt and 0<r<°°: 



* ■ in 



2tt r°° 2 

e rdrd9 

'0 



f 2tt 


f 2 7T 





1 -r 

L" 7 e 



de = 7T . 



de 



Hence A =/tT . 

Returning to our bound on the entropy of H (X) , we know 

by the basic logarithmic inequality that this will be an 

2 

equality if and only if f (x) = e ~ a ~^ x m We now use tne 
constraints to solve for a and 3. 
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1 = 



f(x)dx = 



e^" 6 * dx 



= e 



' a A7fT 



1 = I x f(x)dx = 



dx 



xe 



-a-Rx 



f 00 



28 



e-*-e x dx 



+ yg- e"" a /W71T. 



Therefore e" a = /37¥ - 23/3/77 from which we conclude that 
3 = 1/2 and e a = 1//2tt . Hence the maximum entropy among 
all standard continuous random variables is achieved precisely 
when 

dens (X =x) = e" x /2 . 

/2Y 



We say that X has the standard normal distribution in this 
case. More generally by changing the origin (zero point) and 
unit of measurement (scale) we get a collection of random 
variables, each determined by its mean and variance. 
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Definition. A continuous random variable X is said to have 
the normal or Gaussian distribution with mean m and variance 
a 2 if 

, v s 1 -(x-m) 2 /2a 2 . 
dens (X=x) = e 

a/2? 

2 

For brevity we will write that X is N(m,o ). Some authors 
write N(m,a) instead of N(m,a 2 ); one should beware. We leave 
it as an exercise to verify that the above density really 
does define a probability distribution with mean m and variance 
a 2 and that all of them have the standard normal distribution 
as their common standardization. 

There are more distributions determined by maximum en- 
tropy, especially those in statistical thermodynamics and 
quantum mechanics , but I trust that you now see the basic 
ideas . 

Summary 

The four principal processes of probability theory are all 
determined by maximum entropy properties. We summarize this 
here. 
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Type of Entropy Class of Random Variables Definition 



(Finite partition) 
entropy 


Finite-valued 


HOT)- T p(B.)t»Ll-J 

1—1 i. 1 J 

H(X) = H(TT(X)) 


Relative entropy 


Continuous with 
values in [0,a] 


r 


a r i ) 

f(x)^n|j-^y dx-£n(a) 


Boltzmann entropy 


Continuous 


H(X> = ( Q f(xHn{^dx 


Standard entropy 


Continuous 


H 


r -\ 

~- = H(X)-*n(o) 



Types of Entropy 



Distribution/Model Class of Random Variables for 

which entropy is maximized 



Sampling 


Finite uniform on n points 


Random variables taking at 
most n values 


Placements of 1 ball into n boxes 


Uniform 


Uniform on [0 , a] 


Continuous random variables 
taking values in [0,a] 
(Relative or Boltzmann 
entropy) 


Sampling one point completely 
at random from [0,a] 


Poisson 


Exponential, intensity a 


Positive continuous random 
variables of mean 1/ct. 
(Boltzmann entropy) 


Continuous memoryless 
waiting time, intensity a 


Wiener 


2 

Normal , N (m^ a ) 


Continuous random variables 
having mean m and variance 
2 

a (Boltzmann entropy) 


Position of a continuous 

random walk of rate a starting at 

m 



Maximum Entropy Distributions 



Distribution Entropy Relative entropy Boltzmann Standard 

entropv entropy 



Finite uniform on n 
points 


log 2 (n) bits 








Uniform on [0,a] 







*n (a) 


j Pn(12) 
^ 1.2425 nats 


Exponential, intensity a 






l-£n(a) 


1 


Normal, N(m,o^) 






1 : 

£n(ov / 2Tre)| £n(/2ire) j 

! ! 

j = 1. it 189 nats j 
i i 



Values of entropy 
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4. Exercises for 

Chapter VII Entropy and Information 



1. A visitor to an imaginary country finds that the inhabitants 
of city A always tell the truth, while the inhabitants of 
city B always lie. The visitor wants to know which city 

he is in. He may ask only yes-no direct questions. (A ques- 
tion such as "If I were to ask you what city I am in, what 
would you say?" are indirect and would only confuse an 
inhabitant.) How many questions must the visitor ask? Note 
that an inhabitant of city A could be temporarily residing 
in city B and vice versa. 

2. You are given twelve coins, one of which is counterfeit, 
and a balance. The counterfeit coin is either light or 
heavy, you do not know which. How many weighings are 
necessary to determine which coin is counterfeit? 

3. You are given five coins, some of which may be counterfeit. 
A counterfeit coin is lighter than a good coin, and all 
counterfeit coins weigh the same. Again you are given a 
balance. How many weighings are necessary to find all 

of the counterfeit coins? 

4. A deck of 52 cards is said to have been randomly shuffled 
if all 521 permutations are equally likely. What we 
normally regard as being a random shuffle is in fact very 
far from random. For example, the cut-and-interlace 
shuffle (also called the perfect shuffle ) has the following 
property. If a new deck (in the standard Bridge order: 

2 of clubs, 3 of clubs ,ace of spades) is perfectly 
shuffled, "cut" at a point 4m cards from the top, and dealt 
as in a Bridge game, then each of the four players will 
receive all the cards of one suit. It is known that 
frequent Bridge players are capable of consistently 
achieving a perfect shuffle. 

How much information is contained in a random shuffle? 
in a perfect shuffle? in a random cut? How many 
independent random cuts are needed to achieve a completely 
random shuffle? 

5. Let N be an integer between 1 and 2 000. Divide N by 
6, 10, 22 and 35, and find the remainders. How much 
information about N do these four remainders tell you? 



6. Show that the entropy of the normal distribution N(m,a ) 
is log 2 (a/2-ira) bits. Use this to answer the following 
question. A coin is tossed 1000 times, getting 368 heads 
and 632 tails. How much additional information will one 
more toss of the coin give one? 
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7* Suppose that the imaginary country of exercise 1 
has another city C where the inhabitants alternately tell the 
truth and lie. What is the smallest number of questions the 
visitor must ask to find out which city he is in. 



Generalize problems 2 and 3 above to an arbitrary number 
of coins. 

Let f(x) be a dif f erentiable function defined on [o,a]. 
Assume that f (o) =0 and that |f ' (x) | < b for all x in 
[o,aj . Find an upper bound on the amount of information 
necessary to determine the value of f(x) at every 
x€[o,a] with an error not exceeding e > 0. 

You are playing a variation of "20 questions." A chooses 
a number between 1 and 1,000,000, and B must find this 
number by asking yes-no questions about it, except that 
B asks random questions. How long does it take for B to 
find the number? Let T be the time needed for B to find 
the number. 

Let k, ,k be a sequence of numbers such that 

m (k -locu(n)) = c. Let T be as in exercise 9 above 
n-*oo v n ^2 n n 

but for the problem of guessing a number between 1 and 2 

Prove that lim P(T =k ) = e" 1/2c . 

n->°° n n 
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VIII Markov Chains 

All the processes we have considered so far have been 
based on sequences of independent equidistributed random 
variables. We now consider processes which are based on 
sequences of dependent random variables but for which the 
dependence is of the simplest possible kind: the future 
depends on the present but not on the past. 

1. The Markov Property 

Let X Q , X l , X 2 , . . . . be a sequence of integer random 

variables. We think of the values of the X^s as being 

the states of the Markov chain. Thus if (X n = i) , we say 

the process is in state i at. time n. Moreover, if 

(X' = i) and (X = j) , then we say there was a transi - 
n n+l 

tion from state i to state j a_t time n. 

Definition . A sequence X , X x , of integer random variables 

forms a Mar kov chain if for any integers i Q , i x i n , 

P(X n = i n |(X = i ) n ( Xl = i x ) n ... n (X^ = i^) ) 

= P (X = i |X , = i , ) . 
n n 1 n-1 n- 1 

In other words, the future states of the Markov chain arc 
dependent only on the present state and not on how the 
Markov chain reached the present state. We call this condi- 
tion the Markov property . 
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The conditional probability 



P ijn - P(X n + l = ^ X n = iJ 

is called the transiti on probability from state i to state 
j £5 time n - By the law of alternatives, the probability 
distribution of X n+1 is determined by the transition prob- 
abilities and the probability distribution of X • 

n 

P(X n+1 = j) - E P(X n+1 = j|X n = i)P(X n = i) 

= I P ijn P(X n = i)' 

As a result we see that all the probability distributions of 
the X n 's as well as all their joint distributions are 
determined by the distribution of X q and the transition 
probabilities . 

We have seen several examples of Markov chains already. 
The Bernoulli process is a Markov chain having two states: 
heads and tails, or 1 and 0. In this case the transition 
probabilities are given by 

P oon = * P lon = P 

P ion = * P lln = P 

(the probability distribution of X Q can be anything). 
Another example is the sequential sampling process. Here 
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the state is the number of red balls in the urn. For if we 
know the number of red balls in the urn as well as the 
number of balls chosen so far (i.e. the time} we can compute 
how many black balls are in the urn. 

The sequential sampling process has the property that 
the transition probabilities depend not only on the states 
i and j but also on n, the number of balls chosen so 
far. In such a case our process is continually changing or 
inhomogeneous . In this chapter we will only study Markov 
chains such that the transition probabilities are independent 
of time. 

Definition A Markov chain X Q , X^... is said to be 
homogeneous if the transition probabilities 

Pij = P(X n + l = i| x n = 3) 
do not depend on n. 

Many apparently inhomogeneous Markov chains can be reinter- 
preted as homogeneous Markov chains, so that this concept is 
not as special as it may at first appear. For example, if 
we define a «• state" of the sequential sampling process to be 
the pair of numbers: (no. of red balls, no. of black balls) , 
then the sequential sampling process is a homogeneous Markov 
chain . 
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When we write the transition probabilities p^_. as a 
matrix we get a matrix M called the transition probability 
matrix of the Markov chain. 



• * * Pn P12 * " # 
' " " P21 P22 ' • * 

1 J 

output state 

The rows represent the starting states and the columns rep- 
resent the ending states, during each unit of time. The 
transition probability matrix determines the Markov chain 
except for the probability distribution of X Q . The entries 
of the matrix must be between and 1, and the sum of the 
entries of each row is 1. On the other hand, we can say 
nothing about the columns. 

Definition A row vector (with possibly infinitely many 
coefficients) is said to be a stochastic vector if all entries 
are between and 1 and the sum of all coefficients is 1. 
A square matrix (with possibly infinitely many rows) is 
called a stochastic matri x if its rows are all stochastic 
vectors . 



input state > 
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The term "stochastic vector" is simply another way of look- 
ing at the probability distribution of an integer random 
variable. We see that distributions and Markov chains give 
rise to a new way of looking at vectors and matrices. A 
pair consisting of a stochastic matrix M and a stochastic 
vector u determines a unique Markov chain such that u is 
the row vector corresponding to the distribution of X Q and 
M is the transition probability matrix. 

We call the distribution of X Q the initial distribution 
of the Markov chain. As we have already remarked, the 
distributions of X 1# X 2 , X 3 , are determined succes- 

sively by the formula 



In terms of matrices, this says that if u R is the stochas- 
tic vector corresponding to the distribution of X n , then 



u , , = u • M 
n+1 n 



is the stochastic vector corresponding to x n+ i' wnere 

u «M is the product of the matrices u„ and M. In other 
n 11 

words, the transition from time n to time n+1 in a 
Markov chain corresponds to matrix multiplication. More 
generally, we can iterate the above formula to get 
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u = u 
n o 



M 



n 



showing explicitly how the distributions of the x n ' s 
depend on u and M. 

The Bernoulli Process 

The Bernoulli process, as the process of tossing a 
coin, is a Markov chain whose transition matrix is 



M = 



P 



Notice that M n = M for all n and that u M n = [q,p] no 
matter what the initial distribution is. 



On the other hand if we use the random walk inter- 
pretation of the Bernoulli process, we get a very different 
Markov chain. In this case the states are the integers, 
both positive and negative. The state represents the posi- 
tion of the random walk at the given time. 



-3 -2 



The transition probabilities are: 

p if j = i+1 (move right) 
d. . = < q if j = i-1 (move left) 
in all other cases 

I 
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The matrix of this Markov chain is an infinite matrix p 
of which looks like this: 



r . 



M = 



o p o o o o 

q o p o o o 

o q o p o o 

o o q o p o 

o o o q o p 



Unlike the coin-tossing manifestation of the Bernoulli 
process, the powers M n of this transition matrix are 
progressive more complicated. Moreover, the behavior of 
this Markov chain does depend on the initial distribution 
X Q . Typically X Q will take some value i with probab- 
ility 1, in which case we say that i is the starting point 
of the random walk. If we start at i = 0, the successive 
distributions X , X 1 t X 2 , of this Markov chain are: 

X 3 = [ 0, 0, 1, 0, 0, ••• ] 

X 1 u i = [ ••• 0, q, 0, p, 0, ] 

X 2 u 2 = [ •*• q 2 , 0, 2pq, 0, p 2 , ••• ] 
• • • 
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When p - q = 1/2 we say the random walk is symmetric . 
In this case the transition matrix M is symmetric. 

As we have already remarked, the random walk model 
and the coin-tossing model are both interpretations of a 
single process: the Bernoulli process. However, the two 
models correspond to very different Markov chains, and 
hence one asks completely different questions about the two 
models. For example, we will consider the question of how 
long it takes for the random walk to return to its starting 
point. One might also consider how many times the random 
walk crosses the origin. These questions will be considered 
not only for random walks but also for more general Markov 
chains. A great number of physical and chemical phenomena 
can be modelled using Markov chains and random walks in 
particular. For example, polymer growth can be modelled 
using two- and three-dimensional random walks. A two- 
dimensional random walk is just a pair of independent 
one-dimensional random walks proceeding simultaneously. 



2. The Ruin Problem 

Suppose that we are gambling in a casino. Suppose 
that we bet $1 on each play and that we win another dollar 
with probability p and lose the dollar with probability 
q. This situation is modelled by a random walk. The 
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starting point X Q is our initial fortune , and the state 
at the time n is our fortune at that time. Unfortunately, 
the random walk model we have just considered does not take 
into consideration the fact that we cannot continue playing 
if we run out of money. Furthermore, there is a number c 
(possibly very large) such that if we ever succeed in 
reaching this state the gambling house must stop allowing 
us to play (or we may simply choose to stop playing if our 
fortune ever reaches c) - 

The Markov chain corresponding to this situation is 
called a random walk with absorbing barriers . The barriers 
are the states and c, and these have the property that 
once one of them occurs, the subsequent states of the Markov 
chain are all this same state. 

q P 
< j > 

I 1 1 1 1 1 1 M 1 1 

j c 

initial fortune 

This Markov chain has only finitely many states so the 
transition probability matrix M is an ordinary square 
matrix. All rows of M except the first and last have the 
same form as the rows of the barrier less random walk. The 
top and bottom rows 
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M = 



c-1 
c 



12 3 4 

1 
q p 
q p 
q p 



c-2 c-1 c 



q p 
1 



J 



have a 1 as the first and last entries respectively, 
indicating that if either of these states is a starting 
state, then the ending state is the same state. 

Other kinds of barriers are possible. Suppose that if 
our fortune decreases to zero at any time, we are given a 
$1 advance (or loan) from an outside source ("Daddy") so 
that we can continue to play. We call this a reflecting 
barrier . Still another possibility is the elastic barrier 
for which we are either reflected or remain in the same 
state depending on some probability. In other words "Daddy" 
will give us a loan, but we may have to wait for it. The 
transition matrix for a random walk having a reflecting 
barrier at c and an elastic barrier at has this form: 
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elastic 
barrier 



M = 



s r 
q p 

q p 



q p 
10 



reflecting 
barrier 



A problem of obvious relevance to any gambler is the 
probability, for a given initial fortune, that the random 
walk will reach state before reaching state c. If 
the gambler's fortune ever reaches state zero, we say the 
gambler is "ruined". For this reason this problem has come 
to be called the ruin problem . This is only the beginning 
of the general question of how Markov chains behave in the 
long run, which we will consider later in this chapter. 

Let A be the event "in the random walk with absorbing 
barriers, the walk reaches before reaching c". Then 
the ruin problem is to compute u.. = P(a|X q = j), for all 
j. Now u Q is 1 because X Q = means we are ruined 
from the start; and u q is for the opposite reason. 
For j i- 0, c we use the conditional law of alternatives 
(see section V.l) conditioning on the possible values of 
X 2 . There are only two alternatives, {X l = j-1) or 
(X x = j+1) , when (X = j). Therefore, 
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P(A|X Q = j) = P(A|(X Q =j) h (x^j-i) )p(x x =:j-i|x =j) 

+ P(A|(X Q = j) n (X^j+DJPCX^j+llX^j) 
= PCAlX^j-Dq + PCAlX^j+Dp, 

by the Markov property and the definition of the transition 

probabilities. We now make the important observation that 

in a homogeneous Markov chain we may view any of the random 

variables X n as the initial distribution of the sequence 

X , X -,,X which is itself a Markov chain having 

n n+l n+z 

the same transition matrix as the original Markov chain 

X , X , X , In other words, except for the numbering 

0. 1 2 

of the random variables and the initial distribution, this 
new Markov chain is the same as the old Markov chain. Since 
A is the event that the gambler is eventually ruined, it 
does not depend on the numbering of the random variables 

X , X lr X 2 , That is, we don't care when the gambler 

is ruined. Hence 

PCAlXj = j-1) = u^, 
and P(A|X X = j + 1) = u j+i" 

Therefore, u^ = P (A | X = j ) = + u_. +1 p, for < j < c. 

An equation of the above form is called a difference 
equation , while the conditions u = 1 and u c = are 
its boundary conditions . A difference equation can be 
solved in a manner exactly analogous to a differential 

equation, except that instead of exponential functions 

cxx i 
u(x) = e we use the functions u_. = ot J , where a is a 
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constant. We'll proceed by steps to emphasize the singul- 
ar ity with differential equation techniques. 

Step 1 . Determine the possible values for a. 

If we substitute = a j in the equation 

u. = u. n q + u. 4l p, we get a j = a j_1 q + a j+1 p. Dividing 
by a j ~\ we find that a = q + a 2 p, a quadratic equation 
in a. Solving for a we find that 



1 ± /l-4pq _ 1 ± A-4p+4p 2 = 1 ± d-2p) = ia , 1} 
" 2p 2p 2p P 



Notice that there are two cases. When p = q = 1/2, there 
is a double root a = 1; and when p * q, there are two 
distinct roots. 

Step 2 . Find the general solution to the difference equation 

When there are distinct roots, the general 
solution is just an arbitrary linear combination of the 
f unc- 

p ^ q, the general solution is 

3 



:tions a j as a ranges over all roots. Thus when 



3 



= C a (S^ + C 2 (l) 
= CAB" + <v 



On the other hand, if there are multiple roots we must 

use functions of the form a j , ja j r j 2 a 3 , usin< ? as 

many as the multiplicity of a as a root. Therefore the 
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general solution when p = q = 1/2 is 

u. - C (1) j + C -j (1) j 

J 1 

= c, + c 2 j . 

Step 3. Use the boundary conditions to find the particular 
solution . 

The boundary conditions are u Q = 1 and u c = . So 
when p ^ q we have : 

o 

U = 1 = C l (I) + C 2 = C l + C 2 
P C 

u = = C , ( a ) + C 9 . 

Solving for C and C 2 we find that 

C x = 1/(1 - (q/p) C ) 

C 2 = ~(q/p) C /(l - (q/p) C ) . 

Hence the particular solution we seek is 

u = (q/p) j - (q/p)° 

^ c 
1 - (q/p) 

On the other hand, when p = q = 1/2, we have 



u = 1 = C + C -0 = C. 

12 1 

u =0=C 1 +C 2 »c. 



Solving for C x and C 2 we find that 
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c, = 1 



C 2 = -1/c. 
Hence the solution in this case is 



Uj = 1 - j/c. 



Summarizing, we find that the probability of ruin starting 
from the initial fortune j is 



P(.A|x =j) = 



]_ _ j/ Cf if p = q = 1/2 (the game is fai 



(cf/p) 3 - (q/p) C f if p f q (the game is unfair) 



1 - (q/p) 



The Solution to the Ruin Problem 



The so-called gambler's ruin paradox refers to the fact 
that the above probabilities are very close to 1 when 
perfectly reasonable values of p, q, j and c are used. 
For example, suppose that a gambler has an initial fortune 
of $500. Suppose that the gambler decides to be smart and 
will quit the moment his fortune reaches $1000. He is play- 
ing $1 bets on black or red in the game of roulette. In 
this game p = 18/38 and q = 20/38. He reasons that although 
the game is unfair, the odds against his eventual win are 
only 10:9. This would be true if he bet his entire $500 on 
one turn of the wheel. However, by betting only $1 at a 
time his probability of ruin is, by the above formula, 
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P(A|X =500) = (20/18) 500 - (2Q/18) 1000 

1 - (20/18) 1000 



= 1 (20/18) 500 - l 



(20/18) 1000 -1 



2i 1 - (10/9)" 500 



> 1 - io~ 22 



Therefore, the gambler has less than one chance in 10 22 of 
eventually winning! 

On the other hand, this says nothing about how long it 
will take for the gambler to be ruined nor whether the 
gambler will enjoy occasional "winning streaks". One can 
clearly see that it will take many more than 500 turns of 
the wheel on the average before the gambler is ruined. 
Moreover, one can show that "winning streaks" and "losing 
streaks" (when suitably defined) are actually probable 
events during long betting sessions. So the "structure" of 
the gambler's ruin is much more complicated than the solution 
to the ruin problem suggests. It is this complexity that the 
gambler is presumably paying for when he bets smaller bets 
instead of the one grand $500 bet on a single turn of the 
wheel . 

We end by considering what happens when c -> «. ne 
can think of this as a random walk with just one absorbing 
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barrier. It corresponds to the gambling situation in which 
the house has infinite resources / and the gambler sets no 
limit on how much he is willing to win. There are three 
cases . 

Unfair game to the gambler : p < q. In this case, 

u « (g/P) j - (q/p) c = (p/q) c " j - i _> - 1 = i 

3 1 - (q/p) C (p/q)° " 1 0-1 

as c _> oo, because p/q < 1. So the gambler certainly 
loses in this case. This is no surprise. 

Fair game : p = q = 1/2. In this case, 

u . = 1 - j/c — > 1 as c -> «>. 
3 

Therefore, the gambler eventually loses even in a fair game. 
Unfair game to the house : p > q. In this case q/p < 1 sc 

U. - (C * /P)j - Wg )C — > (q/F) j . 
3 1 - (q/p) 

Hence there is a positive probability that the gambler 
continues winning forever. This follows essentially from 
the fact that p > q produces a "drift" of the random walk 
to the right as if there were a force acting in the positive 
direction. 
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3. The Graph of a Markov Chain 

The graph of a homogeneous Markov chain is an effective 
method of describing and picturing a Markov chain. Moreover, 
by using these graphs we can view all homogeneous Markov 
chains as being "random walks" but on the graph rather than 
on a straight line. 

Let us begin with a simple example. This is a simple 
model of machine operation. We suppose that there are two 
states 1 = "the machine runs" and 2 = "the machine is 
broken down". During each unit of time (say every hour), 
the machine either works or doesn't work. There is a certain 
probability p that a working machine will stay working 
and a probability p^ that a broken machine will still be 
broken. If we assume that these apply to the machine during 
each unit of time independently of previous states, then this 
model is a homogeneous Markov chain. Its transition matrix 
is 

Pn P12 = i-Pn 

p 21 = 1- P22 ?22 

To picture this Markov chain we draw two points 
(vertices) to represent the states. We then draw lines 
(edges) between these vertices, with arrowheads to denote 
direction, indicating the possiblity of passage from one 
state to another (or the same) state. 
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We now think of the model as describing the motion of a 
point along the edges of this graph. During each unit 
of time, the point follows exactly one of the edges in 
the indicated direction to its other end. The label of 
the edge denotes the probability that that edge will be 
chosen. We may think of the point as representing the 
position of someone "walking" on the graph in which case 
our model represents a "random walk on the graph." 

As graphs, the various random walks we considered in 
the last section look like the following: 



P P P P 




-2-1012 
Random walk (no barriers) 



P 

V >> 

12 c-2 c-1 c 

Random walk (absorbing barriers) 
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Random walk (reflecting barriers) 

The only features that change in the above models are the 
boundaries. There are three kinds of boundaries: 

absorbing boundary reflecting boundary elastic boundary 

Definition The g raph of _a_ homogeneous Markov chain 
consists of 

(a) one vertex for every state, 

(b) for every pair of states i and j such 

that p. . ^ 0, a directed edge from state 
i j 

i to state j . 
Notice that we do not have an edge from state i to state 
j if p. . - 0. 

The Ehrenf est Diffusion Model 

The Ehrenfest model attempts to explain the following 
physical experiment. A container is divided into two equal 
parts by a removable wall. We place k gas molecules in 
the one part and r - k in the other. Then we remove the 
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wall and wait for a time. If we 



k 


r-k 


molecules 


molecules 




\ 



removable wall 

now reinsert the wall, we will find almost the same number 
of particles in each part no matter how many particles were 
initially placed in the two parts. To find an explanation 
for this phenomenon is called the diffusion problem . 

This model was one of the earliest successful attempts 
to explain the phenomenon of diffusion using probability. 
Much more sophisticated models now exist, but it is best to 
start with the simplest model. For this model we imagine 
that we have two urns or containers filled with r balls or 
particles, k in urn 1 and r-k in urn 2. The state of 
the model is 



k balls 
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the number of particles in urn 1. A transition of the model 
consists of transferring one particle from one urn to the 
other urn. Therefore, there are two possible transitions 
from state k: to state k-1 or to state k+1. The 
transit ion probabmties are assigned in such a way that 
every particle has the same probability of being transferred 
to the other urn as any other particle. Therefore the 
transition probabilities are: 



P k,k-1 = k ^ 



} k # k+l ~ (r-k)/r 



or using graphs the transitions from k look like this 



k/r 



k-1 



r-k 



k+1 



The entire graph of this Markov chain i: 



r-1 



r-k+1 r-k 





k-1 K k+1 r-2 

The graph of the Ehrenfest Diffusion Model 
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In other words, this Markov chain is a random walk with 
reflecting barriers but with a "central force" tending to 
keep the state near §. We will consider in section 5 
what it means to say that the state of this model 

"tends" to be near r/2. 



Balls into Boxes 

Problems of placing balls into boxes can often be 
stated in terms of Markov chains. For example, suppose we 
are sequentially placing balls into n boxes and that we 
want to know how fast the boxes are being "filled". The 
sta te of this Markov chain is the number of boxes having 
at least one ball. The transitions are either from a state 
k to the same state if the next ball goes into an already 
occupied box or to the state k+1 if the next ball occupie £ 
a new box. Since each 



k/n 



The transitions from state 
->• k+1 k in the Balls into Boxes 



n k Markov chain 

n 



box is equally likely to contain the next ball, the 
probabilities for these two cases are k/n and 1-k/n 
respectively. The graph of this Markov chain looks like 
this : 







1 



2 



n-1 



n 



7* — 




> 



^ n i 

n 



I 



1 



n 



2 



n 



n 



The graph of the Balls into Boxes Markov Chain 

A Genetics Model 

The laws of genetics in biology are intrinsically 
probabilistic. We will consider a very simplified model 
but one which exhibits the basic ideas. We imagine that a 
relatively small population of females is introduced to a 
large ambient population. Consider a single gene having 
two alleles: a dominant allele A and a recessive allele 
a. Suppose that the distribution of the three possible 
genotypes is [p, q, r] in the ambient population, i.e. 
the fraction of the ambient population having genotype AA 
is p, having Aa is q and having aa is r. If the 
females mate the males randomly (at least with respect to 
this gene) , then the distribution of the genotypes in 
successive generations of females in the subpopulation will 
form a Markov chain. 

The states of the Markov chain are the three genotypes 
and the transitions consist of the change of state from a 
mother to her daughter. The probabilities for parents havi 



8.24 



given genotypes are the well-known Mendelian laws: 
parents children 



ther 


father 


AA 


Aa 


aa 


AA 


AA 


1 








AA 


Aa 


1/2 


1/2 





AA 


aa 





1 





Aa 


AA 


1/2 


1/2 





Aa 


Aa 


1/4 


1/2 


1/4 


Aa 


aa 





1/2 


1/2 


aa 


AA 





1 





aa 


Aa 





1/2 


1/2 


aa 


aa 








1 



The Mendelian probabilities for a pair of alleles of one g 



Since we know the distribution of the genotypes in the 
ambient population, and since we have assumed the females 
mate randomly with respect to this gene, we can compute t 
probabilities for each given female genotype to give rise 
to a given daughter genotype: 

mother daughter 

AA Aa aa 



AA 


P 


+ 


q/2 


q/2 + r 







Aa 


P/2 


+ 


q/4 


1/2 


q/4 


+ r/2 


aa 









P + q/2 


q/2 


+ r 
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This is the transition matrix of the Markov chain. The 
graph of the Markov chain is: 



3/2 + r q/4 + r/2 




P + q/2 1/2 q /2 + r 

We will leave it as an exercise to alter this model 
to include "preferences" of females of a given genotype 
for males of another (or the same) genotype as well as to 
include "survival probabilities" for each of the genotypes 
of the daughters. One can also construct a model that 
includes the variation of the distribution of the genotypes 
of both sexes. The resulting model is a pair of interacting 
Markov chains acting simultaneously. 

Unlike our other examples of Markov chains, we have not 
considered this Markov chain as a "random walk." We could 
do so by considering only one "line" of females: a mother, 
her oldest daughter, her oldest daughter, etc. But as long 
as the number of children born by a female is independent of 
her genotype, it is more reasonable to regard this Markov 
chain as the sequence of distributions of the successive 
female generations. 

More generally, we can view any Markov chain not as a 
random walk by a single particle but as a random walk by a 
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very large population of particles all simultaneously 
"walking" on the graph. It is this point of view that is 
best when we consider the long term behavior of a Markov 
chain. The behavior of a single particle along its walk 
can be quite intricate. But the general behavior of the 
whole population of points is very predictable and stable. 

4 . The Markov Sample Space 

We have seen three definitions of a Markov chain so far. 
We first defined it to be a sequence X , X lf X 2 ,-... of 
random variables satisfying the Markov property. We then 
saw that it is equivalent to specify a stochastic matrix and 
a stochastic vector. Finally we saw that we can visualize 
Markov chains as random walks on graphs. But a Markov 
chain is a stochastic process so we must have a sample space 
and a probability. 

The sample space ft of a Markov chain consists of all 
possible infinite paths along edges of its graph. By 
"infinite" we mean that the path has a starting point but no 
ending point. In other words, the sample space is the set 
of all possible sequences {i , i lf i 2 ,-...) of states. This 
formulation is formally analogous to the definition of the 
Bernoulli process (coin tossing) which is a special case. 

To define the probability on ft we need the transition 
matrix M and the stochastic vector u representing the 
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initial distribution. It is standard notation to write 

pf n) for the entries of the matrix M n ; pf n ^ is the prob- 

ability for the Markov chain to be in state j given that 

it was in state i exactly n units of time previously. 

The components of u q are usually denoted by a^. The 

elementary events of the sample space Q are the subsets 

t h 

(X n = i) , i.e. the set of all paths whose n 

vertex corresponds to state i. The probability of an 

elementary event is given by 

P(X n - i) - S a jP jj>. 

So far this process is defined analogously to the Bernoulli 
process. However, we do not postulate that elementary 
events are independent. Instead we specify their probabil- 
ities to be 



P( (X_ = i.) n (X 



n 



n 2 



= lj] n • • • 



n ' (X n k " V 



(n k _n k-l ) 



v (Hi) (n 2 -nj) ^ k 
) a . p . . 1 p . f 1 • • • p . ■ 

<■ D j 1 1 F iii 2 1 k-l 



In particular, we define 



P((X Q = i ) n ( Xl = n ... n(X n = i R ) ) 

= a . p . . p . . • • • p . . . 

i i ii iii2 ^1 n 
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This last expression determines the preceding ones. 



Definition Given a stochastic matrix M = (P^j) and a 
stochastic vector u Q = (a i ) , the Markov chain whose transi- 
tion probability matrix is M and whose initial distribution 
is u Q is defined by 

(1) the sample space ft is the set of all possible sequen 
ces of states (i / ij ,....); 

(2) the elementary events are the subsets (X n = i) of all 
sequences whose n entry is 1; 

(3) the probability is defined by 
P<(X l -i,)n(X 1 -i 1 )o...n(X n -i n ))-a. |) P i(ii P iiij --.p in _ iin . 

Connectivity 

We now classify Markov chains with respect to various 
properties relevant to their long term behavior. The most 
obvious property is connectivity. If we draw the graph of 
the Markov chain, it should be clear what 




12 3 4 

A Markov chain having two connected parts 



we mean when we say the graph is disconnected : it is made 
up of two or more parts having no edges between the parts. 
If the graph is not disconnected, we say the Markov chain 
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is connected. Clearly each connected part of a disconnected 
Markov chain acts like a Markov chain by itself independent 
of the rest of the Markov chain. Because of this we will 
always assume our Markov chains are connected. 



Persiste nce and Transience 

The next property we consider is recurrence. Having 
once occurred, a state must occur again or it may not. More 
precisely, let A^ be the event "state i eventually 
occurs at some time after 0." Then either P(A^|X Q = i) = 1 
or P(A^|X = i) < 1. We call those two possibilities 
persistence and transience. 



Definition A state i is persis tent if the probability of 
returning to state i (after it has occurred at least 
once) is 1. A state i is transient if the probability 
is positive for the state i never to occur again. 



Now, having once occurred, a persistent state must necessarily 
occur infinitely many times: each time it occurs, we repeat 
the argument that it must occur once more, so it can never 
"stop" occurring. 

Consider the various random walks we have seen so far. 
In the random walk with absorbing barriers, the barriers 
are obviously persistent. Notice that in this case only one 



8. 30 



of the persistent states can occur during any one "walk". 
Although persistent states occur infinitely often if they 
ever occur at all, it is quite possible in a connected 
Markov chain for a persistent state never to occur. The 
interior states in the random walk with absorbing barriers 
are all transient because we know with probability 1 that 
either one or the other barrier will be encountered even- 
tually. In the random walk with one absorbing barrier and 
one reflecting (or elastic) barrier, there is just one 
persistent state. On the other hand, if both barriers are 
reflecting or elastic, all the states become persistent. 
We leave it as an exercise to prove these last two statements 
using the solution to the ruin problem. 

Finiteness 

With respect to persistence and transience, there is a 
striking difference between finite and infinite Markov 
chains. In a finite Markov chain some state must be persis- 
tent. But for infinite Markov chains it is quite possible 
for every state to be transient. Consider the ordinary harrier- 
less random walk. Suppose that the random walk is not 
symmetric, say p > q. Recall that in our solution to the 
ruin problem, we noted that there is a positive probability, 
1 _ (q/p)^, such that, starting in state j, the random walk 
forever drifts to the right and never encounters state 0. 
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Now, if we start in state 0, the next state is state 1 

with probability p, and from here the probability is 

1 - q/p for never to occur again. Therefore 

with probability p(l-q/p) = p - q > 0, state never 

occurs again. By the same argument all the states of a 

nonsymmetric barrierless random walk are 

probability of never 

propability of going back to state 

going to state 1 Jl^FJ frcm state 1. 



-4 -3 -2 -1 1 2 3 4 5 

The probability of never returning to state is at least p-q. 

transient. In the long run, a nonsymmetric random walk 
drifts forever either to the right if p > q or to the left 
if p < q. 

On the other hand, for the symmetric random walk every 
state is persistent. In our solution to the ruin problem, 
we noted that starting in state j > 0, state eventually 
occurs with probability 1, and this generalizes to any two 
states i and j. So not only is every state persistent, 
but also every state occurs infinitely often. 

Periodici ty 

The last property we will consider is periodicity. An 
example of a periodic Markov chain is the following: 

1' 3 N - :»4 



2 




"2 



We will give a precise definition later. Such a Markov 
chain will "cycle" endlessly with period 4 in a merry-go-round 
fashion. These are not very interesting Markov chains with 
respect to long term behavior since they all essentially 
look more or less like this one. More precisely, one can 
prove that one can divide the states into classes 

G lf G 2 , ,G in such a way that the graph of the Markov 

chain looks like this: 

The only edges in the 
graph are from states in 
G^ to states in G i + ± 

or from states in G t to 
states in G x . 




And moreover , the lo 
just as if it were a 
we will assume that 



ng term behavior of e 
Markov chain by itse 
our Markov chains are 



ach piece G^ is 
If. For this reason 
not periodic. 



Ergodicity 

Having gone through all the above preliminaries we 
find that the most interesting Markov chains with respect to 
their long term behavior are finite, connected and nonper iodic. 
We will assume in addition that every state is persistent. 
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The reason for this 
transient state can 
Hence all transient 



is that in a finite Markov chain no 
occur more than finitely many times, 
states eventually cease to be relevant. 



Definition A homogeneous Markov chain is said to be 
ergodic if it is finite, connected and nonperiodic, and if 
all its states are persistent. 



For the rest of this chapter we will study only 
ergodic Markov chains. 



^ • S teady States of Ergodic Markov Chains 

The most surprising fact about ergodic Markov chains 

is that the long term behavior of such a Markov chain is 

independent of the initial distribution X . That is, no 

matter what the initial distribution X , the distributions 

of the X 's as n — > 00 will tend toward one particular 
n 

distribution which we call the steady state or invariant 
distribution . 

The best way to view this distribution is take the 
point of view mentioned at the end of section 3. 
Instead of thinking of the Markov chain as the motion of a 
single particle along the graph, we think of an entire 
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population of particles as simultaneously "walking" along 
the graph. The steady state distribution has the property 
that although individual particles are in constant motion, 
the population as a whole has a fixed distribution. So if 
we choose not to distinguish one particle from another, we 
would perceive no change as the Markov chain proceeds in 
time. 

Definition For any Markov chain X , X x , X 2 ,...., a 
probability distribution for X Q such that all the X^ 1 s 
are equidistributed is called a steady state or invariant 
distribution of the Markov chain. 

To find a steady state distribution we make use of the 
terminology of vectors and matrices that we introduced in 
section 1. If we write u Q for the vector correspond- 

ing to the initial distribution and if M is the transi- 
tion matrix, then 

u"i = u Q M 

is the vector corresponding to the distribution of X . Now 
if u 2 = u , then clearly all subsequent distributions will 
be the same as the first two. Hence a steady state distribu- 
tion corresponds to an eigenvector whose eigenvalue is 1. 
Finding all such eigenvectors for a given matrix M is a 
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simple exercise in linear algebra (simultaneous equations) . 
Having found all such eigenvectors, the steady state distri*- 
butions are those whose components are between and 1 
and add to 1 . 

Consider for example the machine operation model 



12 



Pll L 



22 



^2! 



whose transition matrix is 



n * 12 



P 21 P 22 



where p 12 = 1 - p n 



and p = 1 - p^. To find a steady state distribution we 



must solve 



[x , x J - [x , X J 

1 Z 1 £. 



Pll Pl2 



P2I P22 



X 



or 



1 = X lPll + X 2P21 



for Xj and x 2 . This system 
of equations reduces to the single equation 



x 2 = x iPl2 + x 2 p^ 



12 



X 2 X 1 * p 



21 



Therefore the general eigenvector belonging to the eigenvalue 
1 is 



C J 1 , p /p ] . 

x r 12 ' r 21 



This will be a stochastic vector provided 



C + Cp „ /p = 1 . 

^ 12 ' ^ 21 



Solving for C, we find that the invariant distribution of 
this Markov chain is r 



P P 

21 12. 



P, + P., ' Pl2 + P 



21 



or in terms of X Q , 



P21 

P(X n = 1) = 



P12 + p 
P12 



P(X = 2) = - 

*12 *21 



For example, if the machine breaks down with probability 
1/10 every hour and if a broken machine will go back into 
service with probability 1/2 every hour, then the machine 
will be running 

P 2i .5 .5 _ 5 



p +p . 1 + . 5 .6 6 

^ 12 21 



of the time. Moreover, this will be true in the long run 
whether the machine is initially running or initially 
broken down. 
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Waiting Times and the Recurrence Theorem 

In all the stochastic processes we have studied so far, 

the waiting times have played a crucial role. So it is also 

with Markov chains. For each state j, we define the random 

variable T j to be the waiting time for return to state j , 

given that one starts in state j . More generally we have 

the waiting times T. . for the occurance of state j 

ID 

starting from state i. In terms of Markov events 

(T ij =n) = (X„=i) n (X^j)n ... n (X n _^ j ) n (X r = j ) . 

Of course, we have that T. = T. .. For Markov chains in 

D DD 

general these are not really random variables because one 

may have states for which IP(T. .=n) ^ 1 (in fact one can 

n ID 

have P(T. .=n) = for all n) . We call such an object 
iD 

a defective random variable . However, we specifically chose 
to restrict attention to ergodic Markov chains because in 
this case all the waiting times are ordinary random vari- 
ables . 

There is a slight technicality that we ought to 

mention briefly. The waiting times T. . are not defined 
1 ID 

on the same sample space. In fact T_^ is defined on 

the Markov chain for which X Q takes initial value i with 

probability 1. 

The standard notation for P{T.=n) is ff n ^ and for 

D D 

P(T..=n) is ff 1 ?^. The random variable T. is also called 
ID ID D 
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the recurrence time of state j, and its expectation 

E(T ) - E nf (n) is called the mean recurrence time of state 
j n J 

j . We can now give a precise definition of what it means 
for a Markov chain to be ergodic. 

Definition A finite, connected Markov chain is ergodic if 

(1) for every state j, EP(T.-n) = 1 (every state 

n J 

is persistent) , 

(2) for every state j* Pjj > except for at 
most a finite number of times n (no state 
is periodic) . 

The most important facts about waiting times are 
contained in the following remarkable 

Recurrence Theorem For any finite, ergodic Markov chain, 

fl \ D ( n ) > i/e(T.) as n — > 00 for all i and j, 

*ij D 

(2) the components of the steady state distribution 
are l/E(Tj). 
One can rewrite statement (1) in the form 

lim P(x n = j) - l/E(Tj) . 

n— > co 

Combining this with statement (2), we find that ergodic 
Markov chains satisfy an analogue of the Central Limit 
Theorem: 
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No matter what the initial distribution is, the distri- 
butions of the random variables X R of an ergodic Markov 
chain necessarily converge to the steady state distribution. 



Furthermore, the steady state distribution may be regarded 
as specifying the average time the random walk exists in 
the various states. We then have that the average length 
of time between occurrences of state i is the inverse of 
the average time spent in state i. While this is an 
intuitively clear result, it is far from being easy to prove. 



The Ehrenfest Diffusion Model 



We illustrate the Recurrence Theorem for a nontrivial 
example. In this model the transition probabilities are 

i/r if j = i - 1 

1 - i/r if j = i + 1 
otherwise 



p. . = \ 



We begin by computing the steady state distribution. We 
must solve the system of equations 



if j = 



Pj = Zp i P ij = W-l /r ' 



if j = r 

l P j . 1 ( £ ^ ±1 ) + P i+1 ( J ^)' otherwise 



The third equation gives us a recursive expression for p. 
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in terms of Pj„j_ and p j-2 : 

Pj = p j - 1 (- r ^ + p j+1 (^). 

p j+i = p j ( J?T } " p j-i ( T^ ' 

shifting indices: 

Now using the fact that p Q = p^r, we solve by setting 
p = 1 (the solution is only determined up to a scalar 
multiple so it doesn't matter what we use for p ) and by 
applying the above recursion successively. This gives: 



P Q = 1 
Pj = r 

r n r r(r-l) 

p 2 = r«2-l"2 = ~~2 

r (r-l) r r (r-1) _ r(r-l) (r-2) 

P 3 = — T~* 3 " r * _ 3 2-3 

A pattern is clearly developing. It seems that p.. = [^) 
In fact, the formula 
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is just a combination of indentities (1) and (.5) in section 
II. 6". 

Therefore the eigenvectors belonging to eigenvalue 1 

of the transition matrix for this Markov chain has j 

component c(T}, for any constant C ^ 0. For the steady 

state distribution we must choose C so that EC } = 1. 

: 3 

But we know that Z ( r ) = 2 r (identity (10) of section 

j 3 

II. D. Therefore C = 2 r . The steady state distribution 
therefore has 

P(X = j) = (j}2" r . 

This is none other than the binomial distribution for r 
tosses of a fair coin! In other words, it is as if we 
placed the particles of the Ehrenfest model into the two 
urns one at a time according to the toss of a fair coin. 

We know that the binomial distribution is closely 
approximated by the normal distribution. Therefore X is 
very close to having the distribution N (V2 , r /4 ) 
(p = q = 1/2 and n = r) . Therefore X Q - r/2 is 

Zr74 

approximately N(0,1). Suppose that we use confidence 
level 99.9%. Then 

r x o - r / 2 \ 
P 1-3.3 < — < 3. 3 f = .999 

{ frPS ) 

implies that P(|x - r/2 | < 1.65/r} = .999. In an actual 
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24 

experiment, r will be of the order 10 . Consider the 
24 

case r = 10 . If we replace the barrier between the 
urns, we will find with probability .999 that 

|X - 5 x 10 23 [ < 1.65 x 10 12 . 

12 

Although 1.65 x 10 is a very large number, it is less 
than 10 of the number of particles in either urn, so 

that any departure of X from being exactly r/2 would 
be very difficult to detect. 

The value of P(X = j) when X has the steady state 
distribution is 1/E(T^) by the Recurrence Theorem. Again 
using the normal approximation, we have that 



E( T ) ~ c 2(j-r/2) 2 /r 



If j = r/2, we get E(T^) ~ )/2 |^ . When r = 10 24 , this 
12 

is about 10 . This may seem to be very large / but this is 

only because the unit of time we are using is very small. 

24 

On the other hand, if j = and r = 10 , then 

E(T ) = 1/P(X = 0) = 2 r = 2 1024 * 10 1q23 . Even if our 

-50 

time scale is as small as 10 sec, this waiting time 
dwarfs even the waiting time mentioned in section IV. 7 
(on the writing of Hamlet at random by a monkey) . Clearly 
this state does not occur very often. 
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6. Exercises for 

Chapter VIII Markov Chains 

1. Compute the probability of the gambler's ruin for a gambler 
having initial fortune $500 and upper limit on winnings $1000, 
who is playing roulette and who is making bets of $10 on red or 
black; do the same for bets of $100. What advice would you give 
to the gambler? to the gambling casino? 

2. In doing homework problems each success improves the chance 
of another success, while each failure tends to increase the 
chance of subsequent failure. Build a Markov chain model for 
this . 

3. Consider the following model of the spread of disease. There 
are N persons in the population. Some are sick and the rest 
are not . 

(a) when a sick person meets a healthy one , the healthy one 
becomes sick with probability a, 

(b) all encounters are between pairs of persons, 

(c) all possible encounters in pairs are equally likely, 

(d) one such encounter occurs per unit time, 

(e) during each unit of time each sick person recovers 

with probability 3 independently of (a) - (d) and 

of the pervious time spent sick. 

Let X be the number of sick persons at time n. Write the 
n 

transition matrix for this Markov chain, and draw its graph. 
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4*. Alter the genetics model in section 3 to include a genetic 
advantage for one of the genotypes (say, for the Aa genotype). 
How would you include a perference by females having certain 
genotypes for males having certain other genotypes. Assume that 
the genotypes AA and Aa are indistinguishable from one 



another 



that in a finite random walk without absorbing barriers, 



5*. Prove 
all states are persistent 



6. A man has two girl friends A and B, one living uptown and 
one living downtown, respectively. He either visits one of his 
girl friends on a given evening, or he stays at home. The day 
after an evening at home he goes to the bus stop at a random time 
and takes whichever bus comes first, the bus uptown or the bus 

. • • , ti rpquectivelv. The buses run in 

downtown, visiting A or B, respectively. 

both directions during every 15 minute interval, on a fixed 

schedule. The man is not too compatible with A, for after a 

visit to her, he stays home the next evening with probability 

9/10 and visits her again with probability 1/10. On the other 

hand he is quite compatible with B; after a visit to her, 

he visits her again the next evening with probability 9/10 and stays 

home with probability 1/10. Set up the Markov chain for this 

process. Much to the man's surprise, he spends as many evenings 

with A as with B on the average. Compute how frequently 

he spends his evening at home, on the average. See exercise li.lt- 
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7. Compute the steady state distribution of the genetics model 
in section 3. Notice that it is not in general the same as the 
distribution of genotypes in the larger population. 

8*. Compute the steady state distribution for the more general 
genetics models in exercise 4. 

9. Compute the steady state of the symmetric finite Markov chain 
with reflecting barriers. How often does "Daddy" advance you a 
loan on the average? 

10. In exercise HI. 2 the San Francisco bar in question is 100 
yards uphill from the Bay, but the drunk's home is only 10 yards 
uphill from the bar. How probable is it that the drunk falls 
into San Francisco Bay before finding his way home? 

11. In a chemical solution there are initially N molecules, 
each being one of types A, B, C or D . During every unit of 
time exactly one collision occurs between a pair of these 
molecules, all possible collisions being equally likely. During 
such a collision nothing happens unless the colliding molecules 
are A and B or are C and D. If A and B collide, there 
is a probability a that they react and become a pair of C 

and D molecules. If C and D collide, there is a 
probability 3 that they become A and B. In chemical symbols 

a 

A + B - — * C + D. 

3 
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The state of this system is totally determined by the number of 
A molecules. Let the number of A molecules at time n be X n . 
What is the transition matrix for this Markov chain? What is the 
steady state number of molecules of each kind? 

12. Same as exercise 11 above, but for the autocatalytic 
reaction 

a 

A + A --^ B + A. 

B 

13*. Generalize exercises 11 and 12 to an arbitrary 

number of initial molecules of each kind. Can the reaction rate 

constants be determined from the steady state distribution? 
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properties of 4.7 
Standardization 4-1 If, 4.20 
Starting point 8.7 
State (Physics) 2.3 

of a Markov chain 8.1 
Statistical tests, see Tests, statistical 
Steady state 8.35ff 
Stirling's formula 2.8, 2.35 
Stochastic process 1.20 
matrix 8.4 
vector 8.4 

Stroke 1.23 
Subevent 1.4, 1.15 
Sub-multiset 1.24 
Subset 1.^, 2.9, 2.20f 
Sufficient statistic 7.4f, 7.19 
Sum of multisets 1.12 
Symmetric difference 1.23 

random walk 3.17, 8.8 
System (Physics) 2.3 



Tail event 4.43 
Taylor's formula 6.37 

Tests, statistical 4.26ff, 4.49ff, 4.6lff 

single-tail 4.33 
double-tail 4. 33 

Time 2.1, 1 .20 
Total certainty 7.6 

uncertainty, see Randomness, complete or total 
Transience 8.30 
Transition 8.1 

probability 8.2 

probability matrix 8.4 
Truncation of a random variable 7.22 
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Uncertainty 

Uniform distribution 3.40, 3.62 

process 3.41, 3.50f, 3.62f, 4.9f, 6.10ff, 6.33ff, 6.55 
Union 1 .3f , 1 . 1 6 
Urn 2.2 



Variance Z+.2, 4.47f 

addition theorem 4. 6 
basic properties 4.7 
tables of 4. 1 1 , 6.54 

Venn diagrams 1 .4f f 



Waiting times 3.4, 5.12, 6.1f 

Bernoulli process 3.4ff, .3*61 f , 6.1 
Poisson process 6.2, 6.32f, 6.49 
Markov chains 8.38 

Word 2.2 



Zero-One Law 4.43 
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